Discriminative  Learning  with  Markov  Logic  Networks 


Tuyen  N.  Huynh 

Department  of  Computer  Seienees 
University  of  Texas  at  Austin 
Austin,  TX  78712 
hntuyen  @  es.utexas.edu 

Doetoral  Dissertation  Proposal 

Supervising  Professor:  Raymond  J.  Mooney 


Abstract 

Statistical  relational  learning  (SRL)  is  an  emerging  area  of  research  that  addresses  the  problem  of 
learning  from  noisy  structured/relational  data.  Markov  logic  networks  (MLNs),  sets  of  weighted  clauses, 
are  a  simple  but  powerful  SRL  formalism  that  combines  the  expressivity  of  first-order  logic  with  the 
flexibility  of  probabilistic  reasoning.  Most  of  the  existing  learning  algorithms  for  MLNs  are  in  the 
generative  setting;  they  try  to  learn  a  model  that  maximizes  the  likelihood  of  the  training  data.  However, 
most  of  the  learning  problems  in  relational  data  are  discriminative.  So  to  utilize  the  power  of  MLNs,  we 
need  discriminative  learning  methods  that  well  match  these  discriminative  tasks. 

In  this  proposal,  we  present  two  new  discriminative  learning  algorithms  for  MLNs.  The  first  one 
is  a  discriminative  structure  and  weight  learner  for  MLNs  with  non-recursive  clauses.  We  use  a  vari¬ 
ant  of  Aleph,  an  off-the-shelf  Inductive  Logic  Programming  (ILP)  system,  to  learn  a  large  set  of  Horn 
clauses  from  the  training  data,  then  we  apply  an  L\-regularization  weight  learner  to  select  a  small  set  of 
non-zero  weight  clauses  that  maximizes  the  conditional  log-likelihood  (CLL)  of  the  training  data.  The 
experimental  results  show  that  our  proposed  algorithm  outperforms  existing  learning  methods  for  MLNs 
and  traditional  ILP  systems  in  term  of  predictive  accuracy,  and  its  performance  is  comparable  to  state- 
of-the-art  results  on  some  ILP  benchmarks.  The  second  algorithm  we  present  is  a  max-margin  weight 
learner  for  MLNs.  Instead  of  maximizing  the  CLL  of  the  data  like  all  existing  discriminative  weight 
learners  for  MLNs,  the  new  weight  learner  tries  to  maximize  the  ratio  between  the  probability  of  the  cor¬ 
rect  label  (the  observable  data)  and  and  the  closest  incorrect  label  (among  all  the  wrong  labels,  this  one 
has  the  highest  probability),  which  can  be  formulated  as  an  optimization  problem  called  “1 -slack”  struc¬ 
tural  SVM.  This  optimization  problem  can  be  solved  by  an  efficient  algorithm  based  on  the  cutting  plane 
method.  However,  this  cutting  plane  algorithm  requires  an  efficient  inference  method  as  a  subroutine. 
Unfortunately,  exact  inference  in  MLNs  is  intractable.  So  we  develop  a  new  approximation  inference 
method  for  MLNs  based  on  Linear  Programming  relaxation.  Extensive  experiments  in  two  real-world 
MLN  applications  demonstrate  that  the  proposed  max-margin  weight  learner  generally  achieves  higher 
F\  scores  than  the  current  best  discriminative  weight  learner  for  MLNs. 

For  future  work,  our  short-term  goal  is  to  develop  a  more  efficient  inference  algorithm  and  test 
our  max-margin  weight  learner  on  more  complex  problems  where  there  are  complicated  relationships 
between  the  input  and  output  variables  and  among  the  outputs.  In  the  longer-term,  our  plan  is  to  develop 
more  efficient  learning  algorithms  through  online  learning  and  algorithms  that  revise  both  the  clauses 
and  their  weights  to  improve  predictive  performance. 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

2009  ^  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2009  to  00-00-2009 

4.  TITLE  AND  SUBTITLE 

Discriminative  Learning  with  Markov  Logic  Networks 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  Texas  at  Austin, Department  of  Computer 

Sciences, Austin, TX, 78712 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  aS 

unclassified  unclassified  unclassified  Report  (SAR) 

36 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Contents 


1  Introduction  3 

2  Background  4 

2.1  MLNs  and  Alchemy  .  4 

2.2  ILP  and  Aleph .  6 

2.3  Structural  Support  Vector  Machines .  6 

3  Discriminative  structure  and  weight  learning  for  MLNs  with  non-recursive  clauses  9 

3.1  Discriminative  Structure  Learning .  9 

3.2  Discriminative  Weight  Learning .  9 

3.3  Experimental  Evaluation .  11 

3.3.1  Data .  12 

3.3.2  Methodology .  12 

3.3.3  Results  and  Discussion .  13 

3.4  Related  Work .  15 

3.5  Summary .  16 

4  Max-Margin  Weight  Learning  for  MLNs  16 

4.1  Max-Margin  Eormulation .  16 

4.2  Approximate  MPE  inference  for  MENs .  17 

4.3  Approximation  algorithm  for  the  separation  oracle .  20 

4.4  Experimental  Evaluation .  20 

4.4.1  Datasets .  20 

4.4.2  Methodology .  21 

4.4.3  Results  and  Discussion .  21 

4.5  Related  Work .  24 

4.6  Summary .  25 

5  Proposed  Research  26 

5.1  Improving  the  predictive  performance .  26 

5.1.1  Revising  MENs .  26 

5.1.2  Optimizing  non-linear  performance  metrics .  27 

5.2  More  efficient  learning  algorithm .  27 

5.2.1  Online  learning .  27 

5.2.2  Efficient  MPE  and  loss-augmented  MPE  inference  algorithms .  28 

5.3  Experiments  on  additional  problems .  29 

6  Conclusions  29 

References  31 


2 


1  Introduction 


A  lot  of  data  in  the  real  world  are  in  the  form  of  relational/struetured  data  sueh  as  graphs,  multi-relational 
data,  ete.  These  struetured  data  eontain  a  lot  of  entities  (or  objeets)  and  relationships  among  the  entities.  For 
example,  bioehemieal  data  eontain  information  about  various  atoms  and  their  interaetions,  soeial  network 
data  eontain  information  about  people  and  relationships  between  them,  and  so  on.  Moreover,  there  are  a 
lot  of  uneertainties  in  these  data:  uneertainty  about  the  attributes  of  an  objeets,  the  type  of  an  objeet,  as 
well  as  relationships  between  objeets.  Statistieal  relational  learning  (SRL)  (Getoor  &  Taskar,  2007)  whieh 
eombines  ideas  from  rieh  knowledge  representations,  sueh  as  first-order  logie,  with  those  from  probabilistie 
graphieal  models  is  an  emerging  area  of  researeh  that  addresses  the  problem  of  learning  from  these  noisy 
struetured/relational  data. 

A  variety  of  different  SRL  models  have  been  proposed  over  the  last  ten  years.  Among  them,  Markov 
Logie  Networks  (MLNs)  (Riehardson  &  Domingos,  2006)  whieh  are  sets  of  weighted  first-order  elauses 
are  a  simple  but  powerful  formalism.  It  generalizes  both  first-order  logie  and  Markov  networks.  MLNs 
are  eapable  of  representing  all  possible  probability  distributions  over  a  finite  number  of  objeets  (Riehardson 
&  Domingos,  2006).  Moreover,  MLNs  also  subsume  other  SRL  representations  sueh  as  probabilistie  rela¬ 
tional  models  (Roller  &  Pfeffer,  1998)  and  relational  Markov  networks  (Taskar,  Abbeel,  &  Roller,  2002). 
Therefore,  in  this  work,  we  have  ehosen  MLNs  as  the  model  for  doing  researeh. 

Most  of  the  existing  learning  algorithms  for  MLNs  are  in  the  generative  setting:  they  try  to  learn  a  model 
that  maximizes  the  likelihood  of  the  training  data.  However,  most  of  the  learning  problems  in  relational  data 
are  diseriminative.  For  example,  in  the  bioehemieal  data,  the  goal  is  to  learn  a  model  that  diseriminates 
the  aetive  ehemieal  eompounds  from  the  inaetive  ones  based  on  their  moleeular  struetures.  This  problem 
is  ealled  the  strueture  aetivity  relationship  predietion  (SAR),  and  it  is  an  important  task  in  drug  design  and 
diseovery  (Ring,  Sternberg,  &  Srinivasan,  1995).  Another  example  is  eolleetive  web-page  elassiheation. 
Given  a  set  of  web-pages  of  a  department,  the  task  is  to  simultaneously  elassify  these  web-pages  into  some 
pre-defined  eategories  based  on  their  eontent  and  the  hyperlinks  between  them  (Slattery  &  Craven,  1998). 
It  is,  therefore,  an  important  researeh  problem  to  develop  diseriminative  learning  algorithms  for  MLNs  that 
improves  its  predietive  performanee  on  these  diseriminative  tasks. 

In  this  proposal,  we  present  two  new  diseriminative  learning  algorithms  for  MLNs.  The  first  one  is 
a  diseriminative  strueture  and  weight  learner  for  MLNs  with  non-reeursive  elauses  (Huynh  &  Mooney, 
2008).  We  use  a  variant  of  Aleph  (Srinivasan,  2001),  an  off-the-shelf  Induetive  Logie  Programming  (ILP) 
system,  to  learn  a  large  set  of  Horn  elauses  from  the  training  data,  then  we  apply  an  Li-regularization 
weight  learner  to  seleet  a  small  set  of  non-zero  weight  elauses  that  maximizes  the  eonditional  log-likelihood 
(CLL)  of  the  training  data.  The  experimental  results  show  that  our  proposed  algorithm  outperforms  existing 
learning  methods  for  MLNs  and  traditional  ILP  systems  in  term  of  predietive  aeeuraey,  and  its  performanee 
is  eomparable  to  state-of-the-art  results  on  some  ILP  benehmarks.  The  seeond  algorithm  we  present  is  a 
max-margin  weight  learner  for  MLNs  (Huynh  &  Mooney,  2009).  Instead  of  maximizing  the  CLL  of  the  data 
like  all  existing  diseriminative  weight  learners  for  MLNs,  the  new  weight  learner  tries  to  maximize  the  ratio 
between  the  probability  of  the  eorreet  label  (the  observable  data)  and  and  the  elosest  ineorreet  label  (among 
all  the  wrong  labels,  this  one  has  the  highest  probability),  whieh  ean  be  formulated  as  an  optimization 
problem  ealled  “l-slaek”  struetural  SVM  (Joaehims,  Finley,  &  Yu,  2009).  Joaehims  et  al.  (2009)  presents 
an  effieient  algorithm  for  solving  this  optimization  problem  based  on  the  eutting  plane  method.  However, 
this  eutting  plane  algorithm  requires  an  effieient  inferenee  method  as  a  subroutine.  Unfortunately,  exaet 
inferenee  in  MLNs  is  intraetable.  So  we  develop  a  new  approximation  inferenee  method  for  MLNs  based 
on  Linear  Programming  relaxation.  One  advantage  of  the  max-margin  weight  learner  is  that  it  ean  be 


3 


adapted  to  maximize  a  variety  of  performanee  metries  in  addition  to  elassifieation  aeeuraey  (Joaehims, 
2005).  Extensive  experiments  in  two  real-world  MEN  applieations  demonstrate  that  the  proposed  max- 
margin  weight  learner  generally  aehieves  higher  Fi  seores  than  the  eurrent  best  diseriminative  weight  learner 
for  MENS. 

Eor  future  work,  our  short-term  goal  is  to  develop  a  more  effieient  inferenee  algorithm  and  test  our  max- 
margin  weight  learner  on  more  eomplex  problems  where  there  are  eomplieated  relationships  between  the 
input  and  output  variables  and  among  the  outputs.  In  the  longer-term,  we  plan  to  work  on  the  following 
problems: 

•  Improving  the  predictive  performance 

All  of  the  eurrent  diseriminative  weight  learners  assume  that  the  strueture  (the  elauses)  is  eorreet, 
and  only  try  to  fix  the  weights.  However,  it  is  possible  that  the  input  strueture  has  some  errors  that 
eannot  be  fixed  by  only  modifying  fhe  weighfs.  Henee,  we  plan  fo  develop  new  algorifhms  fhaf  fry  fo 
revise  bofh  fhe  elauses  and  fheir  weighfs  af  fhe  same  time.  On  fhe  ofher  hand,  we  also  wanf  fo  exfend 
max-margin  weighl  learner  fo  optimize  ofher  non-linear  performanee  mefries. 

•  More  efficient  learning 

Mosf  of  fhe  exisfing  learning  algorifhms  for  MENs  are  in  fhe  bafeh  seffing.  However,  fhere  are  many 
eases  where  fhis  approaeh  beeomes  eompufafionally  expensive,  espeeially  when  fhe  number  of  frain- 
ing  examples  are  huge.  One  effieienl  alternative  is  online  learning  whieh  proeesses  fhe  fraining  ex¬ 
amples  sequentially.  We  firsl  plan  fo  adapf  some  of  fhe  existing  online  max-margin  weighl  learning 
algorifhms  fo  fhe  ease  of  MENs.  Then  we  plan  fo  look  af  fhe  problem  of  online  sfruefure  learning  and 
revision. 


2  Background 

2.1  MLNs  and  Alchemy 

An  MEN  eonsisfs  of  a  sef  of  weighfed  firsl-order  elauses.  If  provides  a  way  of  soflening  firsl-order  logie 
by  making  sifuafions  in  whieh  nol  all  elauses  are  safisfied  less  likely  buf  nol  impossible  (Riehardson  & 
Domingos,  2006).  More  formally,  lef  X  be  fhe  sef  of  all  proposifions  deseribing  a  world  (i.e.  fhe  sef  of  all 
ground  afoms),  JF  be  fhe  sef  of  all  elauses  in  fhe  MEN,  w,  be  fhe  weighl  assoeialed  wilh  elause  /,•  G 
be  fhe  sef  of  all  possible  groundings  of  elause  /„  and  Z  be  fhe  normalizafion  eonsfanf.  Then  fhe  probabilify 
of  a  parfieular  frufh  assignmenf  x  fo  fhe  variables  in  X  is  defined  as  (Riehardson  &  Domingos,  2006): 

PiX  =  x)  =  ^exp  I  £  w/  £  g(x) 

\fieF 

=  :^exp  (  Y,  wmix)  )  (1) 

where  g(x)  is  1  if  g  is  satisfied  and  0  olherwise,  and  n,(x)  =  §(*)  is  th®  number  of  groundings  of  /, 

lhal  are  satisfied  given  fhe  eurrenl  frufh  assignmenf  fo  fhe  variables  in  X. 

There  are  fwo  main  inferenee  fasks  in  MENs.  The  firsl  one  is  fo  infer  fhe  Mosf  Probable  Explanalion 
(MPE)  or  fhe  mosf  probable  frufh  values  for  a  sef  of  unknown  literals  y  given  a  sef  of  known  lilerals  x. 


4 


provided  as  evidence  (also  called  MAP  inference).  This  task  is  formally  defined  as  follows: 


argmaxP(y|x)  =argmax  — exp  V  w,n,(x,y) 

V  V  ✓  V 


=  argmaxy  w,n,(x,y) 


(2) 


where  is  the  normalization  constant  over  all  possible  worlds  consistent  with  x,  and  n,(x,y)  is  the  number 
of  true  groundings  of  clause  ft  given  the  truth  assignment  (x,y).  MPE  inference  in  MLNs  is  therefore  equiv¬ 
alent  to  finding  the  truth  assignment  that  maximizes  the  sum  of  the  weights  of  satisfied  clauses,  a  Weighted 
MAX-SAT  problem.  This  is  an  NP-hard  problem  for  which  a  number  of  approximate  solvers  exist,  of  which 
the  most  commonly  used  is  MaxWalkSAT  (Kautz,  Selman,  &  Jiang,  1997).  Recently,  Riedel  (2008)  pro¬ 
posed  a  more  efficient  method  to  solve  the  MPE  inference  problem  called  Cutting  Plane  Inference  (CPI), 
which  does  not  require  grounding  the  whole  MEN.  The  CPI  is  a  meta  inference  algorithm  that  incremen¬ 
tally  constructs  some  parts  of  a  large  and  complex  Markov  network  and  then  uses  some  MPE  inference 
algorithm  to  find  the  MPE  solution  on  the  constructed  network.  The  main  idea  is  that  we  don’t  need  to 
ground  the  whole  Markov  network  to  find  the  MPE  solution  since  there  are  a  lot  of  redundant  information 
in  the  whole  network.  However,  the  CPI  method  only  works  well  when  the  separation  step  returns  a  small 
set  of  constraints.  In  the  worst  case,  it  also  constructs  the  whole  ground  MEN. 

The  second  inference  task  in  MENs  is  computing  the  conditional  probabilities  of  some  unknown  query 
literals,  y,  given  some  evidence  x.  Computing  these  probabilities  is  also  intractable,  but  there  are  good 
approximation  algorithms  such  as  MC-SAT  (Poon  &  Domingos,  2006)  and  lifted  belief  propagation  (Singla 
&  Domingos,  2008). 

Eearning  an  MEN  consists  of  two  tasks:  structure  learning  and  weight  learning.  The  weight  learner  can 
learn  weights  for  clauses  written  by  a  human  expert  or  automatically  induced  by  a  structure  learner.  There 
are  two  approaches  to  weight  learning  in  MENs:  generative  and  discriminative.  In  discriminative  learning, 
we  know  a  priori  which  predicates  will  be  used  to  supply  evidence  and  which  ones  will  be  queried,  and 
the  goal  is  to  correctly  predict  the  latter  given  the  former.  Several  discriminative  weight  learning  methods 
have  been  proposed,  all  of  which  try  to  find  weights  that  maximize  the  Conditional  Eog  Eikelihood  (CEE) 
(equivalently,  minimize  the  negative  CEE).  In  MENs,  the  derivative  of  the  negative  CEE  with  respect  to  a 
weight  Wi  is  the  difference  of  the  expected  number  of  true  groundings  E^[ni]  of  the  corresponding  clause 
fi  and  the  actual  number  according  to  the  data  n,-.  However,  computing  the  expected  count  Ew[ni]  is  in¬ 
tractable.  The  first  discriminative  weight  learner  (Singla  &  Domingos,  2005)  uses  the  structured  perceptron 
algorithm  (Collins,  2002)  where  it  approximates  the  intractable  expected  counts  by  the  counts  in  the  MPE 
state  computed  by  the  MaxWalkSAT.  Eater,  Eowd  and  Domingos  (2007)  presented  a  number  of  first-order 
and  second-order  methods  for  optimizing  the  CEE.  These  methods  use  samples  from  MC-SAT  to  approx¬ 
imate  the  expected  counts  used  to  compute  the  gradient  and  Hessian  of  the  CEE.  Among  them,  the  best 
performing  is  preconditioner  scaled  conjugate  gradient  (PSCG)  (Eowd  &  Domingos,  2007).  This  method 
uses  the  inverse  diagonal  Hessian  as  the  preconditioner. 

Regarding  structure  learning,  there  are  currently  two  main  approaches  for  learning  clauses  for  MENs. 
The  first  one  is  a  top-down  approach  (Kok  &  Domingos,  2005;  Biba,  Eerilli,  &  Esposito,  2008).  These 
algorithms  can  start  from  an  empty  network  or  from  an  existing  knowledge  base.  So  they  can  be  used  for 
learning  a  new  MEN  or  revising  an  existing  MEN.  The  algorithms  usually  start  from  the  set  of  unit  clauses, 
and  iteratively  add  new  clauses  to  the  model.  In  each  step,  they  try  to  find  the  best  clause  to  add  to  the 
current  MEN  by  adding,  deleting,  or  flipping  the  sign  of  a  literal  (Kok  &  Domingos,  2005)  or  performing 
a  stochastic  local  search  (Biba  et  ah,  2008).  The  weight  of  each  candidate  clause  is  set  to  optimize  the 


5 


weighted  pseudo  log-likelihood  (WPLL)  (Kok  &  Domingos,  2005)  through  an  optimization  procedure.  Then 
each  candidate  structure  is  scored  by  the  WPLL  (Kok  &  Domingos,  2005)  or  by  the  CLL  (Biba  et  ah,  2008), 
and  the  best  candidate  clause  is  add  to  the  learnt  MLN.  The  other  approach  is  the  bottom-up  one  (Mihalkova 
&  Mooney,  2007;  Kok  &  Domingos,  2009).  Mihalkova  and  Mooney  (2007)  proposed  the  first  bottom-up 
structure  learner  for  MLNs  called  BUSL.  It  first  constructs  Markov  network  templates  from  the  data  and  then 
generates  candidate  clauses  from  these  network  templates.  All  candidate  clauses  are  also  evaluated  using 
WPLL,  and  added  to  the  final  MLN  in  a  greedy  manner.  Recenfly,  Kok  and  Domingos  (2009)  proposed 
a  new  boffom-up  sfrucfure  learner  for  MLNs  called  LHL.  The  main  idea  of  fhis  algorifhm  is  based  on  fhe 
observation  fhaf  a  relational  dafabase  can  be  viewed  as  a  hypergraph  wifh  consfanfs  as  nodes  and  relations  as 
hyperedges.  Then  a  clause  can  be  consfrucfed  from  a  pafh  in  fhe  hypergraph.  However,  a  hypergraph  usually 
confains  an  exponenfial  number  of  pafhs.  So  fo  make  if  fracfable,  fhe  algorifhm  firsf  tiffs  fhe  hypergraph  by 
joinfly  clusfering  all  fhe  consfanfs  in  fhe  relational  dafabase  fo  form  higher-level  concepfs,  fhen  finds  pafhs 
in  fhe  tiffed  hypergraph. 

Alchemy  (Kok,  Singla,  Richardson,  &  Domingos,  2005)  is  an  open  source  soffware  package  for 
MLNs.  If  includes  implemenfafions  for  all  of  fhe  major  existing  algorifhms  for  sfrucfure  learning,  generafive 
weighf  learning,  discriminafive  weighf  learning,  and  inference.  Our  proposed  algorifhms  are  implemenfed 
using  Alchemy. 

2.2  ILP  and  Aleph 

Tradifional  ILP  sysfems  discriminafively  learn  logical  Horn-clause  rules  (logic  programs)  for  inferring  a 
given  fargef  predicafe  given  information  provided  by  a  sef  of  background  predicafes.  These  purely  logical 
definifions  are  induced  from  Horn-clause  background  knowledge  and  a  sef  of  posifive  and  negafive  fuples  of 
fhe  fargef  predicafe.  For  more  information  abouf  ILP,  please  see  (Dzeroski,  2007) 

Aleph  is  a  popular  and  effective  ILP  sysfem  primarily  based  on  Progol  (Mugglefon,  1995).  The 
basic  Aleph  algorifhm  consisfs  of  four  sfeps.  Firsf,  if  selecfs  a  posifive  example  fo  serve  as  fhe  “seed” 
example.  Then,  if  consfrucfs  fhe  mosf  specific  clause,  fhe  “boffom  clause”,  fhaf  enfails  fhaf  selecfed  example. 
The  boffom  clause  is  formed  by  conjoining  all  known  facfs  abouf  fhe  seed  example.  Nexf,  Aleph  finds 
generalizafions  of  fhis  boffom  clause  by  performing  a  general  fo  specific  search.  These  generalized  clauses 
are  scored  using  a  chosen  evaluafion  mefric,  and  fhe  clause  wifh  fhe  besf  score  is  added  fo  fhe  final  fheory. 
This  process  is  repeafed  unfil  if  finds  a  sef  of  clauses  fhaf  covers  all  fhe  posifive  examples.  Aleph  allows 
users  fo  cusfomize  each  of  fhese  steps,  and  fhereby  supporfs  a  variefy  of  specific  algorifhms. 

2.3  Structural  Support  Vector  Machines 

In  fhis  section,  we  briefly  review  fhe  sfrucfural  SVM  problem  and  an  algorifhmic  schema  for  solving  if 
efficienfly.  For  more  defail,  see  Tsochanfaridis,  Joachims,  Hofmann,  and  Alfun  (2005),  Joachims  el  al. 
(2009).  In  slrucfured  oulpul  predicfion,  we  wanl  fo  learn  a  function  h\  ^  ^  'S^,  where  ^  is  fhe  space  of 
inpuls  and  “3^  is  fhe  space  of  mullivariale  and  slrucfured  oulpuls,  from  a  sef  of  Iraining  examples  S\ 

5=  ((xi,yi),...,(x„,y„))  G  (S'  x  ^)" 

The  goal  is  fo  find  a  funclion  h  fhaf  has  low  predicfion  error.  This  can  be  accomplished  by  learning  a 
discriminanf  funclion  /  :  S'  x  ^  R,  then  maximizing  /  over  all  y  G  for  a  given  input  x  to  get  the 
prediction. 

h„{x)  =  argmax/„,(x,y) 


6 


The  discriminant  function  fw{x,y)  takes  the  form  of  a  linear  function: 


fw{x,y)  =  w^W{x,y) 

where  w  G  /?”  is  a  parameter  vector  and  'F{x,y)  is  a  feature  vector  relating  an  input  x  and  output  y.  The 
features  need  to  be  designed  for  a  given  problem  so  that  they  capture  the  dependency  structure  of  y  and 
X  and  the  relations  among  the  outputs  y  .  Then,  the  goal  is  to  find  a  weight  vector  w  that  maximizes  the 
margin: 

Y(xi,yi-,w)  =  w^*F(xi,yi)-  max  w^'F{xi,y\) 

y'enr/ 

The  max-margin  problem  above  can  be  formulated  as  an  optimization  problem  called  structural  SVM 
(Tsochantaridis,  Joachims,  Hofmann,  &  Altun,  2004;  Tsochantaridis  et  ah,  2005)  as  follows: 


Optimization  Problem  1  (OPl):  Structural  SVM 


1  r  ” 
min  -w^w  H — 
w,|>o  2  n 


s.t.  yi,yy  G?V\yi :  w^['f^(x,-,y,-)  - '^^(•^03')]  >!-'§< 


The  slack  variables  are  used  to  allow  some  errors  in  the  training  data,  and  the  scalar  C  >  0  is  a  hyper¬ 
parameter  that  controls  the  trade-off  between  minimizing  the  training  error  and  maximizing  the  margin.  This 
formulation  implicitly  imposes  a  zero-one  loss  on  each  constraint  which  is  inappropriate  for  most  kinds  of 
structured  output  since  it  treats  a  prediction  that  is  very  close  to  the  correct  one  as  the  same  as  a  prediction 
that  is  completely  different  from  the  right  one.  To  take  into  account  this  problem,  Taskar,  Guestrin,  and 
Roller  (2003)  proposed  to  re-scale  the  margin  by  the  Hamming  loss  of  the  wrong  label.  This  margin¬ 
rescaling  approach  also  works  for  other  loss  functions  as  well  (Tsochantaridis  et  ah,  2005).  The  resulting 
optimization  problem  is  as  follows: 


Optimization  Problem  2  (OP2):  Structural  SVM  with  Margin-Rescaling 


1  j 

mm  -w  w 

w,‘?>o  2 


s.t.  \fi,\fy  G  ^  :  w^['F(x,-,y;)  - 'F(x,-,y)]  >  4(y,-,y)  -  (§,• 


Note  that,  the  OPl  is  the  OP2  with  zero-one  loss.  Recently,  Joachims  et  al.  (2009)  proposed  a  reformu¬ 
lation  of  the  above  optimization,  called  “1 -slack”  structural  SVMs  which  combines  all  training  examples 
into  one  big  training  example  and  has  only  slack  variable  for  the  new  mega  example: 

Optimization  Problem  3  (OP3):  1-Slack  Structural  SVM  with  Margin-Rescaling 
1 

min  -w  w-hCf 
w,^>o  2 

s.t.y{yi,...,yn)  G^”:  -w^ ^['P(xi,y,)  - 'P(x;,y;)]  >  - 


1 


Algorithm  1  Cutting-plane  method  for  solving  the  “1-slaek  struetural  SVMs”  (Joaehims  et  ah,  2009) 

1:  Input:  S  =  {{xi,yi),...,{xn,yn)),C,e 
2: 

3:  repeat 

4: 

1 

(w,i§)^min  -w  w-|-Ci§ 

w,^>0  2 

s.t.  G  ir  :  -w^f^[^{xi,yi)-^{xi,yi)]  >  -f^A{yi,yi)-^ 

«  ti  « ti 


5:  for  /  =  1  to  n  do 

6:  fi  ^  argmaxf^^{A{yi,y)  +  w^^{xi,y)} 

7:  end  for 

8:  W^Wu{{yu...,yn)}  ^ 

9:  until  i  i  A{yi,yi)  -  I  -  •^(^,■,3?;)]  <  ^  +  £ 

/=1 

10:  return  (w,<§) 


The  1-slaek  reformulation  leads  to  a  faster  and  more  sealable  training  algorithm  whose  running  time  is 
provably  linear  in  the  number  of  training  examples  (Joaehims  et  ah,  2009). 

In  eaeh  iteration,  the  algorithm  1  solves  a  Quadratie  Programming  (QP)  problem  (line  4)  to  find  the 
optimal  weights  eorresponding  to  the  eurrent  set  of  eonstraints  W  and  a  separation  oraele  (line  6),  also 
ealled  a  loss-augmented  inferenee  problem  (Taskar,  Chatalbashev,  Roller,  &  Guestrin,  2005),  to  find  the 
most  violated  eonstraint  to  add  to  W.  The  QP  problem  in  line  4  ean  be  solved  by  any  general  QP  solver.  In 
eontrast,  for  eaeh  representation  (sueh  as  Markov  networks  or  weighted  eontext  free  grammars)  a  speeifie 
algorithm  is  needed  for  solving  the  loss-augmented  inferenee  problem. 

To  enforee  a  sparse  solution  on  the  learned  weights,  we  ean  replaee  the  square  2-norm,  w^w,  on  these 
formulations  by  the  1-norm,  ||w||i  =  ^4=1 hke  previous  work  on  1-norm  SVMs  (Bradley  &  Man- 
gasarian,  1998;  Zhu,  Rosset,  Hastie,  &  Tibshirani,  2003)  for  binary  elassifieation.  Using  the  substitution 
Wi  =  wf  —  wj  and  \wi\  =  wf  +  with  wf  >  0  (Fung  &  Mangasarian,  2004),  we  ean  east  the  1-norm 
minimization  problem  as  a  Linear  Programming  (LP)  problem  and  use  the  algorithm  1  to  solve  the  LP  prob¬ 
lem  by  replaeing  the  QP  problem  in  line  4  by  the  transformed  LP  problem.  A  speeial  ease  of  the  1-norm 
struetural  S  VM  for  the  ease  of  Markov  Networks  is  presented  in  Zhu  and  Xing  (2009). 

In  summary,  to  apply  struetural  SVMs  to  a  new  problem,  one  needs  to  ehoose  a  representation  for  model, 
design  a  eorresponding  feature  veetor  funetion  'P(x,y),  seleet  a  loss  funetion  A(y,y),  and  design  algorithms 
to  solve  the  two  argmax  problems: 

Prediction:  argmaXyg^w^'P(x,y) 

Separation  Oracle:  argmaXyg^{A(y,y)  +w^*P(x,y)} 


3  Discriminative  structure  and  weight  learning  for  MLNs  with  non¬ 
recursive  clauses 

In  this  section,  we  look  at  a  special  class  of  MLNs  where  all  the  clauses  are  non-recursive  clauses  which 
contain  only  one  non-evidence  literal.  We  present  a  new  procedure  for  discriminatively  learning  both  the 
structure  and  parameters  for  this  type  of  MLNs.  The  proposed  approach  is  a  two-step  process.  The  first  step 
uses  an  off-the-shelf  Inductive  Logic  Programming  (ILP)  system,  Aleph  (Srinivasan,  2001),  to  generate 
a  large  set  of  potential  good  clauses.  The  second  step  learns  the  weights  for  these  clauses,  preferring  to 
eliminate  useless  clauses  by  giving  them  zero  weight.  The  weight  learner  in  the  second  step  tries  to  find 
a  small  sef  of  non-zero  weighfs  which  maximizes  fhe  condifional  likelihood  of  fhe  dafa  wifh  respecf  fo 
Li-regularization.  We  firsf  discuss  in  defails  fhe  proposed  approach  and  fhen  fhe  experimenfal  evaluafion. 

3.1  Discriminative  Structure  Learning 

Ideally,  fhe  search  for  discriminafive  MLN  clauses  would  be  direcfly  guided  by  fhe  goal  of  maximizing 
fheir  confribufion  fo  fhe  predicfive  accuracy  of  a  complefe  MLN.  However,  fhis  would  require  evaluating 
every  proposed  refinemenf  fo  fhe  exisfing  sef  of  learned  clauses  by  relearning  weighfs  for  all  of  fhe  clauses 
and  performing  full  probabilisfic  inference  fo  defermine  fhe  score  of  fhe  revised  model.  This  process  is 
compufafionally  expensive  and  would  have  fo  be  repealed  for  each  of  fhe  combinalorially  large  number  of 
polenlial  clause  refinemenls.  Evalualing  clauses  in  slandard  ILP  is  quicker  since  each  clause  can  be  evaluated 
in  isolation  based  on  fhe  accuracy  of  ils  logical  inferences  aboul  fhe  largel  predicate.  Consequenlly,  we 
lake  fhe  heuristic  approach  of  using  a  slandard  ILP  melhod  fo  generate  clauses;  however,  since  fhe  logical 
accuracy  of  a  clause  is  only  a  rough  approximation  of  ifs  value  in  a  final  MLN,  we  generafe  a  large  number  of 
candidates  whose  accuracy  is  al  leasl  markedly  grealer  fhan  random  guessing  and  allow  subsequenf  weighl 
learning  fo  defermine  fheir  value  fo  an  overall  MLN. 

In  order  fo  find  a  sef  of  polenlially  good  clauses  for  an  MLN,  we  use  a  parlicular  configurafion  of  Aleph. 
Specifically,  we  use  fhe  induce_cover  command  and  m-estimate  evaluafion  funclion.  The  induce_cover 
command  implemenls  a  varianl  of  Progol’s  MDIE  greedy  covering  algorilhm  (Mugglelon,  1995)  which 
does  not  remove  previously  covered  examples  when  scoring  a  new  clause.  The  normal  Aleph  induce 
command  scores  a  clause  based  only  on  ils  coverage  of  currenlly  uncovered  positive  examples.  However, 
Ibis  scoring  is  nol  refieclive  of  ils  use  in  a  final  MEN,  and  we  found  lhal  fhe  induce_cover  approach  produces 
a  larger  sef  of  more  useful  clauses  lhal  significanlly  increases  fhe  accuracy  of  our  final  learned  MEN.  The 
m-estimate  (Dzeroski,  1991)  is  a  Bayesian  estimation  of  fhe  accuracy  of  a  clause  (Cussens,  2007).  The  m 
parameler  defining  fhe  underlying  prior  dislribulion  is  aulomalically  sef  fo  fhe  maximum  likelihood  estimate 
of  ils  besl  value.  The  oulpuf  of  induce_cover  is  a  Iheory,  a  sef  of  high-scoring  clauses  lhal  cover  all  fhe 
posilive  examples.  However,  Ihese  clauses  were  selected  based  on  an  m-eslimale  of  fheir  accuracy  under 
a  purely  logical  inferprelalion,  and  may  nol  be  fhe  besl  ones  for  an  MEN.  Therefore,  in  addilion  fo  Ihese 
clauses,  we  also  save  all  generaled  clauses  whose  m-eslimale  is  grealer  fhan  a  predefined  Ihreshold  (sef  fo 
0.6  in  our  experimenls).  This  provides  a  large  sef  of  clauses  of  polenlial  ulilily  for  an  MEN.  We  use  fhe 
name  Aleph-i-i-  fo  refer  fo  Ibis  version  of  Aleph. 

3.2  Discriminative  Weight  Learning 

Compared  fo  Alchemy’s  currenl  besl  discriminative  weighl  learning  melhod  (Eowd  &  Domingos,  2007), 
our  melhod  embodies  Iwo  imporfanl  modificalions:  exact  inference  and  Li-regularization.  This  section 
describes  Ihese  Iwo  modifications. 


9 


First,  given  the  restrieted  nature  of  the  elauses  eonstrueted  by  Aleph,  we  ean  use  an  effieient  exaet 
probabilistie  inferenee  method  when  learning  the  weights  instead  of  the  approximate  inferenee  algorithm 
that  is  used  to  handle  the  general  ease.  Sinee  these  elauses  are  non-reeursive  definite  elauses  in  whieh  the 
target  predieate  only  appears  onee,  a  grounding  of  any  elause  will  eontain  only  one  grounding  of  the  target 
predieate.  For  MLNs,  this  means  that  the  Markov  blanket  of  a  query  atom  only  eontains  evidenee  atoms. 
Consequently,  the  query  atoms  are  independent  given  the  evidenee.  Let  Y  be  the  set  of  query  atoms  and  X 
be  the  set  of  evidenee  atoms,  the  eonditional  log  likelihood  of  Y  given  X  in  this  ease  is: 

logF(F  =  y|A  =x)  =  log  fj  P{Yj  =  yj\X  =  x) 

y=i 

=  '£logP{Yj=yj\X  =  x) 
j=i 


and, 

P{Yj=yj\X=x)  = 

WiHi (v, y [y  y  ] ) ) 

- ^ -  (2) 

exp{  I  Wini{x,y[Y:=o]))+exp{  L  w,n,(x,y[y  p)) 

ie^Yj 

where  is  the  set  of  all  MLN  elauses  with  at  least  one  grounding  eontaining  the  query  atom  Yj, 
ni{x,y^Yj=yj])  is  the  number  groundings  of  the  /th  elause  that  evaluate  to  true  when  all  the  evidenee  atoms  in 
X  and  the  query  atom  Yj  are  set  to  their  truth  values,  and  similarly  for  ni{x,y[Yj=o])  n;(v,y[y^.=i])  when  Yj 
is  set  to  0  and  1  respeetively.  Then  the  gradient  of  the  CLL  is: 

£.^logP{Y  =  y\X=x)  = 


X  [ni{x,y[Yj=y^]) -P{Yj  =  0|A  =  v)n,-(x,y[y_o]) 

-P{Yj  =  1\X  =x)ni{x,y[Yj=i])] 

Notiee  that  the  sum  of  the  last  two  terms  in  the  gradient  is  the  expeeted  eount  of  the  number  of  true  ground¬ 
ings  of  the  /’th  formula.  In  general,  eomputing  this  expeeted  eount  requires  performing  approximate  infer¬ 
enee  under  the  model.  For  example,  Singla  and  Domingos  (Singla  &  Domingos,  2005)  ran  MPE  inferenee 
and  used  the  eounts  in  the  MPE  state  to  approximate  the  expeeted  eounts.  However,  in  our  ease,  using  the 
standard  elosed  world  assumption  for  evidenee  predieates,  all  the  n,’s  ean  be  eomputed  without  approximate 
inferenee  sinee  there  is  no  ground  atom  whose  truth  value  is  unknown.  This  is  a  result  of  restrieting  the 
strueture  learner  to  non-reeursive  definite  elauses.  In  faet,  this  result  still  holds  even  when  the  elauses  are 
not  Horn  elauses.  The  only  restrietion  is  that  the  target  predieates  appear  only  onee  in  every  elause.  Note 
that  given  a  set  of  weights,  eomputing  the  eonditional  probability  P{y\x),  the  CEE,  and  its  gradient  requires 
only  the  n,-  eounts.  So,  in  our  ease,  the  eonditional  probability  P{Yj  =  yj\X  =  x),  the  CEE,  and  its  gradient 
ean  be  eomputed  exaetly.  In  addition,  these  eounts  only  need  to  be  eomputed  onee,  and  Alchemy  provides 
an  effieient  method  for  eomputing  them.  Alchemy  also  provides  an  effieient  way  to  eonstruet  the  Markov 
blanket  of  a  query  atom,  in  partieular  it  ignores  all  ground  formulae  whose  truth  values  are  unaffeeted  by 
the  value  of  the  query  atom.  In  our  ease,  this  helps  reduee  the  size  of  the  Markov  blanket  of  a  query  atom 
signifieantly  sinee  many  ground  elauses  are  satisfied  by  fhe  evidenee.  As  a  resulf,  our  exaef  inferenee  is  very 
fasf  even  when  fhe  MEN  eonfains  fhousands  of  elauses. 


10 


Given  a  procedure  for  computing  the  CLL  and  its  gradient,  standard  gradient-based  optimization  meth¬ 
ods  can  be  used  to  find  a  set  of  weights  that  optimizes  the  CLL.  However,  to  prevent  overfitting  and  select 
only  the  best  clauses,  we  follow  the  approach  suggested  by  Lee,  Ganapathi,  and  Koller  (2007)  and  introduce 
a  Laplacian  prior  with  zero  mean,  P{wi)  =  (j3 /2)  •  exp{—p\wi\),  on  each  weight,  and  then  optimize  the 
posterior  conditional  log  likehood  instead  of  the  CLL.  The  final  objective  function  is: 

logP(T|X)P(w)  =  logP(T|X) -hlogP(w) 

=  logP(T|X)+log(n/’(>v,)) 

i 

=  CLL  +  Y,^og{^  ■  exp{-^\wi\)) 
i  ^ 

=  CLL  —  P'^\wi\  +  constant 

i 

There  is  now  an  additional  term  in  the  objective  function,  which  penalizes  each  non- zero  weight 

Wi  by  j3|w,j.  So,  the  larger  (3  is  (corresponding  to  a  smaller  variance  of  the  prior  distribution),  the  more 
we  penalize  non-zero  weights.  Therefore,  placing  a  Laplacian  prior  with  zero  mean  on  each  weight  is 
equivalent  to  performing  an  Li -regularization  of  the  parameters.  An  important  property  of  Li -regularization 
is  its  tendency  to  force  parameters  to  zero  by  strongly  penalizing  small  terms  (Lee  et  ah,  2007).  In  order  to 
learn  weights  that  optimize  the  Li -regularized  CLL,  we  use  the  OWL-QN  package  which  implements  the 
Orthant-Wise  Limited-memory  Quasi-Newton  algorithm  (Andrew  &  Gao,  2007). 

This  approach  to  preventing  over-fitting  contrasts  with  the  standard  L2 -regularization  used  in  previous 
work  on  learning  weights  for  MLNs,  which  is  equivalent  to  assuming  a  Guassian  prior  with  zero  mean  on 
each  weight  and  does  not  penalize  non-zero  weights  as  severely.  Since  Aleph-i-i-  generates  a  very  large 
number  of  potential  clauses,  Li -regularization  encourages  eliminating  the  less  useful  ones  by  setting  their 
weights  to  zero.  In  agreement  with  prior  results  on  Li -regularization  (Ng,  2004;  Dudik,  Phillips,  &  Schapire, 
2007),  our  experiments  confirm  that  it  results  in  simpler  and  more  accurate  learned  models  compared  to  L2- 
regularization. 

3.3  Experimental  Evaluation 

In  this  section,  we  present  experiments  that  were  designed  to  answer  the  following  questions: 

1.  How  does  our  method  compare  to  existing  methods,  specifically: 

(a)  Extant  discriminative  learning  for  MLNs,  viz.  Alchemy. 

(b)  Traditional  ILP  methods,  viz.  Aleph. 

(c)  “Advanced”  ILP  methods,  viz.  kFOIL  (Landwehr,  Passerini,  Raedt,  &  Frasconi,  2006),  tFOIF 
(Fandwehr,  Kersting,  &  Raedt,  2007),  and  Rumble  (Ruckert  &  Kramer,  2007). 

2.  How  does  each  of  our  system’s  major  novel  components  below  contribute  to  its  performance: 

(a)  Generation  of  a  larger  set  of  potential  clauses  by  using  Aleph-i-i-  instead  of  Aleph. 

(b)  Exact  MEN  inference  for  non-recursive  definite  clauses  instead  of  general  approximate  infer¬ 
ence. 

(c)  Li -regularization  instead  of  L2. 


11 


Table  1 :  Some  baekground  evidenee  and  examples  from  the  Alzheimer  toxie  dataset. 


Background  evidence 

Examples 

r_subst_l(Al,H),  r_subst_l(Bl,H),  r_subst_l(Dl,H),  x_subst(Bl,7,CL),  x_subst(Dl,6,OCH3),  polar(CL,POLAR3), 
polar(OCH3,POLAR2),  great_polar(POLAR3,POLAR2),  size(CL,SIZEl),  size(OCH3,SIZE2),  alk_groups(Al,0), 
alk_groups(Bl,0),  alk_groups(Dl,0),  great_size(SIZE2,SIZEl),  flex(CL,FLEX0),  flex(OCH3,FLEXl) 

less_toxic(Bl,Al) 

less_toxic(Al,Dl) 

less_toxic(Bl,Dl) 

3.3.1  Data 

We  employed  four  benehmark  data  sets  previously  used  to  evaluate  a  variety  of  ILP  and  relational  learning 
algorithms.  They  eoneern  predieting  the  relative  bioehemieal  aetivity  of  variants  of  Taerine,  a  drug  for 
Alzheimer’s  disease  (King  et  al.,  1995).^  The  data  eontain  baekground  knowledge  about  the  physieal  and 
ehemieal  properties  of  substituents  sueh  as  their  hydrophobieity  and  polarity,  the  relations  between  various 
physieal  and  ehemieal  eonstants,  and  other  relevant  information.  The  goal  is  to  eompare  various  drugs  on 
four  important  bioehemieal  properties:  low  toxicity,  high  acetyl  cholinesterase  inhibition,  good  reversal 
of  scopolamine-induced  memory  impairment,  and  inhibition  of  amine  re-uptake.  For  each  property,  the 
positive  and  negative  examples  are  pairwise  comparisons  of  drugs.  For  example,  lessJoxic{di,d2)  means 
that  drug  di’s  toxicity  is  less  than  d2’s.  These  ordering  relations  are  transitive  but  not  complete  (i.e.  for 
some  pairs  of  drugs  it  is  unknown  which  one  is  better).  Therefore,  this  is  a  structured  (a.k.a.  collective) 
prediction  problem  since  the  output  labels  should  form  a  partial  order.  However,  previous  work  has  ignored 
this  structure  and  just  predicted  the  examples  separately  as  distinct  binary  classification  problems.  In  this 
work,  in  addition  to  treating  the  problem  as  independent  classification,  we  also  use  an  MLN  to  perform 
structured  prediction  by  explicitly  imposing  the  transitive  constraint  on  the  target  predicate.  Table  1  shows 
some  background  facts  and  examples  from  one  of  the  datasets,  and  Table  2  summarizes  information  about 
all  four  datasets. 


Table  2:  Summary  statistics  for  Alzheimer’s  data  sets. 


Data  set 

#Examples 

%  Positive 

#  Predicates 

Alzheimer  acetyl 

1326 

50% 

30 

Alzheimer  amine 

686 

50% 

30 

Alzheimer  memory 

642 

50% 

30 

Alzheimer  toxic 

886 

50% 

30 

3.3.2  Methodology 

To  answer  the  above  questions,  we  ran  experiments  with  the  following  systems: 

Alchemy:  Uses  the  structure  learning  (Kok  &  Domingos,  2005)  in  Alchemy  and  the  most  accurate  ex¬ 
isting  discriminative  weight  learning  PSCG  (Lowd  &  Domingos,  2007)  with  the  “ne”  (non-evidence) 
parameter  set  to  the  target  predicate. 

^  Since  the  current  ALCHEMY  does  not  support  real  valued  variables,  we  could  not  test  our  approach  on  the  other  standard  ILP 
benchmark  data  sets  in  molecular  biology. 


12 


Busl:  Uses  BUSL  (Mihalkova  &  Mooney,  2007)  and  PSCG  discriminative  weight  learning  with  the  “ne” 
(non-evidence)  parameter  set  to  the  target  predicate. 

Aleph:  Uses  Aleph’s  standard  settings  with  a  few  modifications.  The  maximum  number  of  literals  in  an 
acceptable  clause  was  set  to  5.  The  minimum  number  of  positive  examples  covered  by  an  acceptable 
clause  was  set  to  2.  The  upper  bound  on  the  number  of  negative  examples  covered  by  an  acceptable 
clause  was  set  to  300.  The  evaluation  function  was  set  to  autojn,  and  the  minimum  score  of  an 
acceptable  clause  was  set  to  0.6.  The  induce_cover  command  was  used  to  learn  the  clauses.  We 
found  that  this  configuration  gave  somewhat  better  overall  accuracy  compared  to  those  reported  in 
previous  work. 

AlephPSCG:  Uses  the  discriminative  weight  learner  PSCG  to  learn  MLN  weights  for  the  clauses  in  the 
final  fheory  relumed  by  Aleph.  Nole  lhal  PSCG  also  uses  L2 -regularization. 

ALEPHExactL2  :  Uses  fhe  limiled-memory  BFGS  algorifhm  (Liu  &  Nocedal,  1989)  implemented  in 
Alchemy  lo  learn  discriminative  MLN  weighls  for  fhe  clauses  in  fhe  final  fheory  relumed  by  Aleph. 
The  objeclive  funclion  is  CLL  wilh  L2  regularizalion.  The  CLL  is  compuled  exaclly  as  described  in 
Section  3.2. 

Aleph-i-i-PSCG:  Like  AlephPSCG,  bul  learns  weighls  for  fhe  larger  sel  of  clauses  relumed  by  Aleph-i-i-. 

ALEPH-i-H-ExactL2:  Like  ALEPHExaclL2,  bul  learns  weighls  for  Ihe  larger  sel  of  clauses  relumed  by 
Aleph-i-i-. 

ALEPH-i-i-ExactLl:  Our  full  proposed  approach  using  exacl  inference  and  Li -regularization  lo  learn 
weighls  on  Ihe  clauses  relumed  by  Aleph-i-i-. 

To  force  Ihe  predictions  for  Ihe  largel  predicate  lo  properly  conslilule  a  partial  ordering,  we  also  fried 
adding  lo  Ihe  learned  MLNs  a  hard  conslrainl  (i.e.  a  clause  wilh  infinite  weighl)  slating  Ihe  Iransilive 
properly  of  Ihe  largel  predicate,  and  used  Ihe  MC-SAT  algorilhm  lo  perform  prediction  on  Ihe  lesl  dala.  This 
exploils  Ihe  abilily  of  MLNs  lo  perform  collective  classification  (slruclured  prediction)  for  Ihe  complete  sel 
of  lesl  examples. 

In  testing,  only  Ihe  background  facls  are  provided  as  evidence  lo  ensure  lhal  all  predictions  are  based  on 
Ihe  chemical  slruclure  of  a  drug.  For  all  systems  excepl  Aleph,  a  Ihreshold  of  0.5  was  used  lo  converl  pre¬ 
dicted  probabilities  info  boolean  values.  The  predictive  accuracy  of  Ihese  algorilhms  for  Ihe  largel  predicate 
were  compared  using  10-fold  cross-validation.  The  significance  of  Ihe  resulls  were  evaluated  using  a  Iwo- 
lailed  paired  1-lesl  lesl  wilh  a  95%  confidence  level.  To  compare  Ihe  qualify  of  Ihe  predicted  probabilities, 
we  also  reporl  Ihe  average  area  under  Ihe  ROC  curve  (AUC-ROC)  for  all  probabilistic  systems  by  using  Ihe 
AUCCalculalor  package  (Davis  &  Goadrich,  2006). 

3.3.3  Results  and  Discussion 

Tables  3  and  4  show  Ihe  average  accuracy  and  AUC-ROC  wilh  slandard  deviation  for  each  system  running 
on  each  dala  sel.  Our  complete  system  (ALEPH-i-i-ExactLl)  achieves  significanlly  higher  accuracy  lhan 
bolh  Alchemy  and  Busl  on  all  4  dala  sels  and  significanlly  higher  lhan  Aleph  on  all  excepl  Ihe  memory 
dala  sel,  answering  questions  1(a)  and  1(b).  In  lurn,  Aleph  has  been  shown  lo  give  higher  accuracy  on 
Ihese  dala  sels  lhan  olher  slandard  ILP  systems  like  FOIL  (Fandwehr  el  ah,  2007).  Alchemy’s  existing 
non-discriminalive  slruclure  learners  find  only  a  few  (3-5)  simple  clauses.  Two  of  Ihem  are  unil  clauses 


13 


Table  3:  Average  predietive  aeeuraeies  and  standard  deviations  for  all  systems.  Bold  numbers  indieate  the 
best  result  on  a  data  set. 


Data  set 

Alchemy 

BUST 

Aleph 

Aleph 

PSCG 

Aleph 

ExactL2 

Aleph-h- 

PSCG 

Aleph-h- 

ExactL2 

Aleph-h- 

ExactLl 

Alzheimer  amine 

50.1  ±0.5 

51.3  ±2.5 

81.6  ±5.1 

64.6±  4.6 

83.5  ±  4.7 

72.0±  5.2 

86.8±  4.4 

89.4  ±  2.7 

Alzheimer  toxic 

54.7  ±  7.4 

51.7  ±5.3 

81.7  ±4.2 

74.7±  1.9 

87.5  ±  4.8 

69.9±  1.2 

89.5±  3.0 

91.3  ±  2.8 

Alzheimer  acetyl 

48.2  ±  2.9 

55.9  ±  8.7 

79.6  ±  2.2 

78.0±  3.2 

79.5  ±  2.0 

76.5±  3.7 

82.1±2.1 

85.1  ±  2.4 

Alzheimer  memory 

50  ±  0.0 

49.8  ±  1.6 

76.0  ±  4.9 

60.3±2.1 

72.6  ±3.4 

65 .6±  5.4 

72.9±  5.2 

77.6  ±  4.9 

Table  4:  Average  AUC-ROC  and  standard  deviations  for  all  systems.  Bold  numbers  indieate  the  best  result 
on  a  data  set. 


Data  set 

Alchemy 

BUST 

Aleph 

PSCG 

Aleph 

ExactL2 

Aleph-h- 

PSCG 

Aleph-h- 

ExactL2 

Aleph-h- 

ExactLl 

Alzheimer  amine 

Alzheimer  toxic 
Alzheimer  acetyl 
Alzheimer  memory 

.483  ±  .115 
.622  ±  .079 
.473  ±  .037 
.452±  .088 

.641  ±  .110 

.511  ±  .079 
.588  ±  .108 
.426  ±  .065 

.846  ±  .041 
.904  ±  .034 
.850  ±  .018 
.744  ±  .040 

.904  ±  .027 
.930  ±  .035 
.850  ±  .020 
.768  ±  .032 

.777  ±  .052 
.874  ±  .041 

.810  ±  .040 
.737  ±  .059 

.935  ±  .032 
.937  ±  .029 
.899  ±  .015 
.813  ±  .059 

.954  ±  .019 
.939  ±  .035 
.916  ±  .013 
.844  ±  .052 

for  the  target  predieate,  sueh  as  great  Jie(al,al )  and  great jne(al,a2y,  the  others  eapture  the  transitive  nature 
of  the  target  relation.  Therefore,  even  after  they  are  diseriminatively  weighted,  their  predietions  are  not 
signifieantly  better  than  random  guessing. 

The  ablations  that  remove  eomponents  from  our  overall  system  demonstrate  the  important  eontri- 
bution  of  eaeh  eomponent.  Regarding  question  2(b),  the  systems  using  general  approximate  inferenee 
(AlephPSCG  and  Aleph++PSCG)  perform  mueh  worse  than  the  eorresponding  versions  that  use  ex- 
aet  inferenee  (ALEPHExactL2  and  ALEPH++ExactL2).  Therefore,  when  there  is  a  target  predieate  that 
ean  be  aeeurately  inferred  using  non-reeursive  definite  elauses,  exploiting  this  restrietion  to  perform  exaet 
inferenee  is  a  elear  win. 

Regarding  question  2(a),  ALEPH++ExactL2  performs  signifieantly  better  than  ALEPHExactL2,  demon¬ 
strating  the  advantage  of  learning  a  large  set  of  potential  elauses  and  eombining  them  with  learned  weights 
in  an  overall  MEN.  Aeross  the  four  datasets,  Aleph-i-i-  returns  an  average  of  6,070  elauses  eompared  to 
only  10  for  Aleph. 

Table  5  presents  average  aeeuraeies  with  standard  deviations  for  the  MEN  systems  when  we  inelude 
a  transitivity  elause  for  the  target  predieate.  This  eonstraint  improves  the  aeeuraeies  of  ALEPHExactL2, 
ALEPH-i-i-ExactL2,  and  ALEPH-i-i-ExactLl,  but  sometimes  deereases  the  aeeuraey  of  other  systems,  sueh 
as  AlephPSCG.  This  ean  be  explained  as  follows.  Sinee  most  of  the  predietions  of  ALEPH-i-i-ExactLl  are 
eorreet,  enforeing  transitivity  ean  eorreet  some  of  the  wrong  ones.  However,  AlephPSCG  produees  many 
wrong  predietions,  so  foreing  them  to  obey  transitivity  ean  produee  additional  ineorreet  predietions. 

Regarding  question  2(e),  using  Li -regularization  gives  signifieantly  higher  aeeuraey  and  AUC-ROC  than 
using  standard  L2 -regularization.  This  eomparison  was  only  performed  for  Aleph-i-i-  sinee  this  is  when  the 
weight-learner  must  ehoose  from  a  large  set  of  eandidate  elauses  by  eneouraging  zero  weights.  Table  6 
eompares  the  average  number  of  elauses  learned  (after  zero-weight  elauses  are  removed)  for  Li  and  L2 


14 


Table  5:  Average  predietive  aeeuraeies  and  standard  deviations  for  MLN  systems  with  transitive  elause 
added. 


Data  set 

Alchemy 

BUSL 

Aleph 

PSCG 

Aleph 

ExactL2 

Aleph-h- 

PSCG 

Aleph-h- 

ExactL2 

Alephh-h- 

ExactLl 

Alzheimer  amine 

50.0  ±  0.0 

52.2  ±  5.3 

61.4  ±  3.6 

87.0  ±  3.3 

72.9±  3.5 

91.7±  3.5 

90.5  ±  3.6 

Alzheimer  toxic 

50.0  ±  0.0 

50.1  ±0.8 

73.3  ±  1.8 

88.8  ±4.8 

68. 4±  1.5 

91.4±3.6 

91.9  ±  4.1 

Alzheimer  acetyl 

53.0  ±6.2 

54.1  ±4.9 

80.4  ±  2.7 

84.1  ±  3.1 

83.3±2.5 

88.7±  2.1 

87.6  ±  2.7 

Alzheimer  memory 

50.0  ±  0.0 

50.1  ±0.5 

58.9  ±  2.3 

76.5  ±  3.5 

70.1±5.2 

81.3±4.8 

81.3  ±  4.1 

Table  6:  Average  number  of  elauses  learned 


Data  set 

Aleph-h- 

Aleph-h- 

ExactL2 

Alephh-h- 

ExactLl 

Alzheimer  amine 

7061 

5070 

3477 

Alzheimer  toxic 

2034 

1194 

747 

Alzheimer  acetyl 

8662 

5427 

2433 

Alzheimer  memory 

6524 

4250 

2471 

regularization.  As  expeeted,  the  final  learned  MLNs  are  mueh  simpler  when  using  Li -regularization.  On 
average,  Li -regularization  reduees  the  size  of  the  final  sef  of  elauses  by  26%  eompared  fo  L2-regularizalion. 

Regarding  question  1(e),  several  researehers  have  fesfed  “advaneed”  ILP  sysfems  on  our  dafasefs.  Ta¬ 
ble  7  eompares  our  besf  resulfs  fo  fhose  reporfed  for  tFOIL  (a  eombinafion  of  FOIL  and  free  augmented 
naive  Bayes),  kFOIL  (a  kernelized  version  of  FOIL),  and  Rumble  (a  max-margin  approaeh  fo  learning  a 
weighted  rule  sef).  Our  resulfs  are  eompefifive  wifh  fhese  reeenf  sysfems.  Additionally,  unlike  MLNs,  fhese 
mefhods  do  nol  ereafe  “deelarafive”  fheories  fhaf  have  a  well-defined  possible  worlds  semanfies. 

3.4  Related  Work 

Using  an  off-lhe-shelf  ILP  sysfem  fo  learn  elauses  for  MLNs  is  nol  a  new  idea.  Riehardson  and  Domingos 
(2006)  used  Claudien,  an  non-diseriminalive  ILP  sysfem  lhal  ean  learn  arbilrary  firsl-order  elauses,  fo 
learn  MLN  sfruefure  and  fo  refine  Ihe  elauses  from  a  knowledge  base.  Kok  and  Domingos  (2005)  reported 
experimenfal  resulfs  eomparing  Iheir  MLN  sfruefure  learner  fo  learning  elauses  using  Claudien,  FOIL, 
and  Aleph.  However,  sinee  this  previous  work  used  the  relatively  small  set  of  elauses  produeed  by  these 
unaltered  ILP  systems,  the  performanee  was  not  very  good.  ILP  systems  have  also  been  used  to  learn 
struetures  for  other  SRL  models.  The  Sayu  system  (Davis,  Burnside,  de  Castro  Dutra,  Page,  &  Costa, 
2005)  used  Aleph  to  propose  eandidate  features  for  a  Bayesian  network  elassifier.  Muggleton(Muggleton, 
2000)  used  Progol,  another  popular  ILP  system,  to  learn  elauses  for  Stoehastie  Logie  Programs  (SLPs). 

When  restrieted  to  learning  non-reeursive  elauses  for  elassifieation,  our  approaeh  is  equivalent  to  using 
Aleph  to  eonstruet  features  for  use  by  Li -regularized  logistie  regression.  Under  this  view,  our  approaeh  is 
elosely  related  to  Maccent  (Dehaspe,  1997),  whieh  uses  a  greedy  approaeh  to  induee  elausal  eonstraints 
that  are  used  as  features  for  maximum-entropy  elassifieation.  One  differenee  between  our  approaeh  and 
Maccent  is  that  we  use  a  two-step  proeess  instead  of  greedily  adding  one  feature  at  a  time.  In  addition,  our 
elauses  are  indueed  in  a  bottom-up  manner  while  MACCENT  uses  top-down  seareh;  and  our  weight  learner 


15 


Table  7 :  Average  predietive  aeeuraeies  and  standard  deviations  of  our  best  results  and  other  “advaneed”  ILP 
systems. 


Data  set 

Our  best  results 

tFOIL 

kFOIL 

Rumble 

Alzheimer  amine 

91.7±  3.5 

87.5  ±  4.4 

88.8  ±5.0 

91.1 

Alzheimer  toxic 

91.9  ±4.1 

92.1  ±  2.6 

89.3  ±  3.5 

91.2 

Alzheimer  acetyl 

88.7±  2.1 

82.8  ±  3.8 

87.8  ±  4.2 

88.4 

Alzheimer  memory 

81.3  ±4.1 

80.4  ±  5.3 

80.2  ±  4.0 

83.2 

employs  Li -regularization  whieh  makes  it  less  prone  to  overfitting.  Unfortunately,  we  eould  not  eompare 
experimentally  to  Maccent  sinee  “only  an  implementation  of  a  propositional  version  of  MACCENT  is 
available,  whieh  only  handles  data  in  attribute-value  (veetor)  format”  (Landwehr  et  al.,  2007).  Additionally, 
MLNs  are  a  more  expressive  formalism  that  also  allows  for  struetured  predietion,  as  demonstrated  by  our 
results  that  inelude  a  transitivity  eonstraint  on  the  target  relation. 

3.5  Summary 

We  have  found  that  existing  methods  for  learning  Markov  Logie  Networks  perform  very  poorly  when  tested 
on  several  benehmark  ILP  problems  in  drug  design.  We  have  presented  a  new  approaeh  to  eonstrueting 
MLNs  that  diseriminatively  learns  both  their  strueture  and  parameters  to  optimize  predietive  aeeuraey  for 
a  stated  target  predieate  when  given  evidenee  speeified  with  a  defined  sef  of  baekground  predieafes.  If 
uses  a  varianf  of  an  exisfing  ILP  system  (Aleph)  fo  eonsfruef  a  large  number  of  pofenfial  elauses  and  fhen 
effeefively  learns  fheir  parameters  by  altering  exisfing  diseriminafive  MLN  weighf-learning  mefhods  fo  uti¬ 
lize  exaef  inferenee  and  L\  regularizafion.  Experimenfal  resulfs  show  fhaf  fhe  resulting  sysfem  oufperforms 
exisfing  MLN  and  ILP  mefhods  and  gives  sfafe-of-fhe-arf  resulfs  for  fhe  Alzheimer’s-drug  benchmarks. 

4  Max-Margin  Weight  Learning  for  MLNs 

In  Secfion  3,  we  aim  fo  learn  a  model  fhaf  maximizes  fhe  CLL  of  fhe  dafa.  If  fhe  goal  is  fo  predicf  accurate 
fargef-predicafe  probabilities,  fhaf  approach  is  well  mofivafed.  However,  in  many  applicafions,  fhe  acfual 
goal  is  fo  maximize  an  alternative  performance  mefric  such  as  classification  accuracy  or  L-measure.  Max- 
margin  mefhods  are  a  competing  approach  fo  discriminative  fraining  fhaf  are  well-founded  in  compufafional 
learning  fheory  and  have  demonsfrafed  empirical  success  in  many  applicafions  (Crisfianini  &  Shawe-Taylor, 
2000).  They  also  have  fhe  advanfage  fhaf  fhey  can  be  adapfed  fo  maximize  a  variefy  of  performance  mefrics 
in  addition  fo  classification  accuracy  (Joachims,  2005).  In  fhis  secfion,  we  presenf  a  max-margin  approach 
fo  weighf  learning  in  MLNs  based  on  fhe  general  framework  for  max-margin  fraining  of  sfrucfured  models 
(Tsochanfaridis  el  al.,  2005;  Joachims  el  al.,  2009). 

4.1  Max- Margin  Formulation 

All  of  fhe  currenf  discriminative  weigh!  learners  for  MLNs  fry  fo  find  a  weigh!  vecfor  w  fhaf  oplimizes 
fhe  conditional  log-likelihood  P(y|x)  of  fhe  query  atoms  y  given  fhe  evidence  x.  However,  an  allernalive 
approach  is  fo  learn  a  weigh!  vecfor  w  fhaf  maximizes  fhe  ralio: 

f(y|x,w) 

P(y|x,w) 


16 


between  the  probability  of  the  eorreet  truth  assignment  y  and  the  elosest  eompeting  ineorreet  truth  assign¬ 
ment  y  =  argmaXygY\y^(y|x)-  Applying  equation  1  and  taking  the  log,  this  problem  translates  to  maximiz¬ 
ing  the  margin: 


7(x,y;w)  =  w^n(x,y)  - w^n(x,y) 

=  w^n(x,y)  —  max  w^n(x,y) 

yGY\y 

Note  that,  this  translation  holds  for  all  log-linear  models.  For  example,  if  we  apply  it  to  a  CRF  (Lafferty, 
MeCallum,  &  Pereira,  2001)  then  the  result  model  is  an  M3N  (Taskar  et  al.,  2003).  In  faet,  this  translation 
is  the  eonneetion  between  log-linear  models  and  linear  elassifiers  (Collins,  2004). 

In  turn,  the  max-margin  problem  above  ean  be  formulated  as  a  “1-slaek”  struetural  SVM  as  deseribed  in 
seetion  2.3: 


Optimization  Problem  4  (OP4):  Max-Margin  Markov  Logic  Networks 

1  ^ 

min  -w  w-|-Cc 
w,^>o  2 

v.t.  VyGF:w^[n(x,y)-n(x,y)]  >4(y,y)-,§ 

So  for  MLNs,  the  number  of  true  groundings  of  the  elauses  n(x,y)  plays  the  role  of  the  feature  veetor 
funetion  W{x,y)  in  the  general  struetural  SVM  problem.  In  other  words,  eaeh  elause  in  an  MLN  ean  be 
viewed  as  a  feature  representing  a  dependeney  between  a  subset  of  inputs  and  outputs  or  a  relation  among 
several  outputs. 

As  mentioned,  in  order  to  apply  Algorithm  1  to  MLNs,  we  need  algorithms  for  solving  the  following 
two  problems: 

Prediction:  argmaxygy  w^n(x,y) 

Separation  Oracle:  argmaXygy{A(y,y)  -|-w^n(x,y)} 

The  predietion  problem  is  just  the  (intraetable)  MPE  inferenee  problem  diseussed  in  seetion  2.1.  We  ean  use 
MaxWalkSAT  to  get  an  approximate  solution,  but  we  have  found  that  models  trained  with  MaxWalkSAT 
have  very  low  predietive  aeeuraey.  On  the  other  hand,  reeent  work  (Finley  &  Joaehims,  2008)  has  found  that 
fully-eonneeted  pairwise  Markov  random  fields,  a  speeial  elass  of  struetural  SVMs,  trained  with  overgener¬ 
ating  approximate  inferenee  methods  (sueh  as  relaxation)  preserves  the  theoretieal  guarantees  of  struetural 
SVMs  trained  with  exaet  inferenee,  and  exhibits  good  empirieal  performanee.  Based  on  this  result,  we 
sought  a  relaxation-based  approximation  for  MPE  inferenee.  We  first  present  an  EP-relaxation  algorithm 
for  MPE  inferenee,  then  show  how  to  modify  it  to  solve  the  separation  oraele  problem  for  some  speeifie  loss 
funetions. 

4.2  Approximate  MPE  inference  for  MLNs 

MPE  inferenee  in  MENs  is  a  speeial  ease  of  MAP  inferenee  in  Markov  networks  with  binary  variables, 
and  there  has  been  a  lot  of  work  on  approximation  algorithms  for  solving  MAP  inferenee  using  eonvex 
relaxation,  see  (Kumar,  Kolmogorov,  &  Torr,  2009)  for  more  details.  However,  these  methods  are  not 
suitable  for  MENs.  Eirst,  most  of  them  are  for  Markov  networks  with  unary  and  pairwise  potential  funetions 
while  a  ground  MEN  may  eontain  many  high-order  eliques.  The  algorithms  ean  be  extended  to  handle  high- 
order  potential  funetions  (Werner,  2008),  but  they  beeome  eomputationally  expensive.  Seeond,  they  do  not 
handle  deterministie  faetor,  i.e.  infinite  potential  funetion.  On  the  other  hand,  MPE  inferenee  in  MENs  is 


17 


equivalent  to  the  Weighted  MAX-SAT  problem,  and  there  are  also  signifieant  work  on  approximating  this 
NP-hard  problem  using  LP-relaxation  (Asano  &  Williamson,  2002;  Asano,  2006).  The  existing  algorithms 
first  relax  and  eonvert  the  Weighted  MAX-SAT  problem  into  a  linear  or  semidefinite  programming  problem, 
then  solve  it  and  apply  a  randomized  rounding  method  to  obtain  an  approximate  integral  solution.  These 
methods  eannot  be  direetly  applied  to  MLNs,  sinee  they  require  the  weights  to  be  positive  while  MLN 
weights  ean  be  negative  or  infinite.  However,  we  ean  modify  the  eonversion  used  in  these  approaehes  to 
handle  the  ease  of  negative  and  infinite  weights. 

Based  on  the  evidenee  and  the  elosed  world  assumption,  a  ground  MLN  eontains  only  ground  elauses 
of  the  unknown  ground  atoms  after  removing  all  trivially  satisfied  and  unsafisfied  elauses.  The  follow¬ 
ing  proeedure  translafes  the  MPE  inferenee  in  a  ground  MLN  into  an  Integer  Linear  Programming  (ILP) 
problem. 

1 .  Assign  a  binary  variable  y,  to  eaeh  unknown  ground  atom,  y,  is  1  if  the  eorresponding  ground  atom  is 
TRUE  and  0  if  the  ground  atom  is  FALSE. 


2.  Lor  eaeh  ground  elause  Cj  with  infinite  weight,  add  the  following  linear  eonstraint  to  the  ILP  problem: 


iei. 


where  Ij,  Ij  are  the  sets  of  positive  and  negative  ground  literals  in  elause  Cj  respeetively. 


3.  Lor  eaeh  ground  elause  Cj  with  positive  weight  wj,  introduee  a  new  auxiliary  binary  variable  zj,  add 
the  term  WjZj  to  the  objeetive  funetion,  and  add  the  following  linear  eonstraint  to  the  ILP  problem: 

£3^/+  L(1-3';)  >Zj 


i'e/, 


Zj  is  1  if  the  eorresponding  ground  elause  is  satisfied. 

4.  Lor  eaeh  ground  elause  Cj  with  k  ground  literals  and  negative  weight  Wj,  introduee  a  new  auxiliary 
boolean  variable  Zj,  add  the  term  —WjZj  to  the  objeetive  funetion  and  add  the  following  k  linear 
eonstrains  to  the  ILP  problem: 


i-yi>zj, 

yi>zj, 


iei] 
i  G  E 


The  final  ILP  has  the  following  form: 

Optimization  Problem  5  (OPS): 


max 


,-GC+  CjGC- 

s-t-  Y^yi  +  X^(i-3'/)  >  1 


V  Cj  where  Wj  =  0° 


yCj  £  c+ 


^-yi>zj 
yi  >  Zj 
yi,Zj  £  {0,1} 


V  i  £  If  and  Cj  £  C 


V  i  £  Ij  and  Cj  £  C 


18 


Algorithm  2  The  modified  ROUNDUP  proeedure 
1:  Input:  The  LP  solution  y  =  {yi, 

2:  F  ^  0 

3:  for  /  =  1  to  n  do 

4:  if  yi  is  integral  then 

5:  Remove  all  the  ground  elauses  satisfied  by  assigning  the  value  of  y,-  to  the  eorresponding  ground 

atom 
6:  else 

7:  add  ji  to  F 

8:  end  if 

9:  end  for 
10:  repeat 

1 1 :  Remove  the  last  item  y,  in  F 

12:  Compute  the  sum  of  the  unsatisfied  elauses 

13:  Compute  the  sum  of  the  unsatisfied  elauses 

14:  if  w+  >  then 

15:  yi  ^  1 

16:  else 

17:  yt  ^  0 

18:  end  if 

19:  Remove  all  the  ground  elauses  satisfied  by  assigning  the  value  of  y,  to  the  eorresponding  ground  atom 

20:  until  F  is  empty 
21:  return  y 


where  y,  appears  as  a  positive  literal 
where  y,  appears  as  a  negative  literal 


where  and  C  are  the  set  of  elauses  with  positive  and  negative  weights  respeetively.  This  ILP  problem 
ean  be  simplified  by  not  introdueing  an  auxiliary  variable  zj  for  unit  elauses,  where  we  ean  use  the  variable 
yi  direetly.  This  reduees  the  problem  eonsiderably,  sinee  ground  MLNs  typieally  eontain  many  unit  elauses 
(Alehemy  eombines  all  the  non-reeursive  elauses  eontaining  the  query  atom  into  a  unit  elause  whose  weight 
is  the  sum  of  all  the  elauses’  weights).  Note  that  our  mapping  from  a  ground  MLN  to  an  ILP  problem  is 
a  bit  different  from  the  one  presented  by  Riedel  (2008)  whieh  generates  two  sets  of  eonstraints  for  every 
ground  elause:  one  when  the  elause  is  satisfied  and  one  when  it  is  not.  For  a  elause  with  positive  weight, 
our  mapping  only  generates  a  eonstraint  when  the  elause  is  satisfied;  and  for  a  elause  with  negative  weight, 
the  mapping  only  imposes  eonstraints  when  the  elause  is  unsatisfied.  The  final  ILP  problem  has  the  same 
solution  with  the  one  in  (Riedel,  2008),  but  it  has  fewer  eonstraints  sinee  our  mapping  does  not  generate 
unneeessary  eonstraints.  We  then  relax  the  integer  eonstraints  yi,Zj  G  {0, 1}  to  linear  eonstraints  yi,Zj  G  [0, 1] 
to  obtain  an  LP-relaxation  of  the  MPE  problem. 

This  LP  problem  ean  be  solved  by  any  general  LP  solver.  If  the  LP  solver  returns  an  integral  solution, 
then  it  is  also  the  optimal  solution  to  the  original  ILP  problem.  In  our  ease,  the  original  ILP  problem  is  an 
NP-hard  problem,  so  the  LP  solver  usually  returns  non-integral  solutions.  Therefore,  the  LP  solution  needs 
to  be  rounded  to  give  an  approximate  ILP  solution.  We  first  tried  some  of  the  randomized  rounding  methods 
in  (Asano,  2006)  but  they  gave  poor  results  sinee  the  LP  solution  has  a  lot  of  fraetional  eomponents  with 
value  0.5.  We  then  adapted  a  rounding  proeedure  ealled  ROUNDUP  (Boros  &  Hammer,  2002),  a  proeedure 
for  produeing  an  upper  bound  binary  solution  for  a  pseudo-Boolean  funetion,  to  the  ease  of  pseudo-Boolean 
funetions  with  linear  eonstraints  (algorithm  2),  whieh  we  found  to  work  well.  In  eaeh  step,  this  proeedure 


19 


picks  one  fractional  component  and  rounds  it  to  1  or  0.  Hence,  this  process  terminates  in  at  most  n  steps, 
where  n  is  the  number  of  query  atoms.  Note  that  due  to  the  dependencies  between  the  variables  y,’s  and 
z/s  (the  linear  constraints  of  the  LP  problem),  this  modified  ROUNDUP  procedure  does  not  guarantee  an 
improvement  in  the  value  of  the  objective  function  in  each  step  like  the  original  ROUNDUP  procedure 
where  all  the  variables  are  independent. 

4.3  Approximation  algorithm  for  the  separation  oracle 

The  separation  oracle  adds  an  additional  term,  the  loss  term,  to  the  objective  function.  So,  if  we  can  represent 
the  loss  as  a  linear  function  of  the  y,-  variables  of  the  LP-relaxation,  then  we  can  use  the  above  approximation 
algorithm  to  also  approximate  the  separation  oracle.  In  this  work,  we  consider  two  loss  functions.  The  first 
one  is  the  0/1  loss  function,  z\o/i  (y^^y)  where  is  the  true  assignment  and  y  is  some  predicted  assignment. 
For  this  loss  function,  the  separation  oracle  is  the  same  as  the  MPE  inference  problem  since  the  loss  function 
only  adds  a  constant  1  to  the  objective  function.  Hence,  in  this  case,  to  find  fhe  mosf  violafed  consfrainf, 
we  can  use  fhe  LP-relaxafion  algorifhm  above  or  any  ofher  MPE  inference  algorifhm.  This  0/1  loss  makes 
fhe  separafion  oracle  problem  easier  buf  if  does  nol  scale  fhe  margin  by  how  differenf  and  y  are.  If  only 
requires  a  unif  margin  for  all  assignmenfs  y  differenf  from  fhe  frue  assignmenf  y^.  To  fake  info  accounf  fhis 
problem,  we  consider  fhe  second  loss  funcfion  fhaf  is  fhe  number  of  misclassified  afoms  or  fhe  Hamming 
loss: 


^Hammingiy  y)  — ¥^yi] 
i 

=  L[(yf  =  0  Ay;  =  1)  V  (yf  =  1  Ay;  =  0)] 

i 

Erom  fhe  definition,  fhis  loss  can  be  represenfed  as  a  funcfion  of  fhe  y;’s: 

^Hammingiy  y)  —  ^  ^  (1  ~y/) 

i:yf=0  r.yf  =  \ 

which  is  equivalenf  fo  adding  1  fo  fhe  coefficienf  of  y,  if  fhe  frue  value  of  y,  is  0  and  subfracfing  1  from  fhe 
coefficienl  of  y,  if  fhe  frue  value  of  y,  is  1.  So  we  can  use  fhe  EP-relaxafion  algorifhm  above  fo  approximafe 
fhe  separafion  oracle  wifh  fhis  Hamming  loss  funcfion.  Anofher  possible  loss  funcfion  is  fhe  Fi  loss  which  is 
equivalenf  fo  1-Fi.  Unforfunafely,  fhis  loss  is  a  non-linear  funcfion,  so  we  cannof  use  fhe  above  approach  fo 
opfimize  if.  Developing  algorifhms  for  opfimizing  or  approximating  fhis  loss  funcfion  is  an  area  for  fufure 
work. 

4.4  Experimental  Evaluation 

This  secfion  presenfs  experimenfs  comparing  fhe  max-margin  weighf  learner  fo  fhe  weighf  learners  in  section 
3  and  fhe  PSCG  algorifhm. 

4.4.1  Datasets 

Besides  fhose  Alzheimer’s  dafasefs  described  in  secfion  3.3.1,  we  also  ran  experimenfs  on  fwo  ofher  large, 
real-world  dafasefs:  Web  KB  for  collective  web-page  classificafion,  and  CifeSeer  for  bibliographic  cifafion 
segmenfafion. 


20 


The  WebKB  dataset  (Slattery  &  Craven,  1998)  eonsists  of  labeled  web  pages  from  the  eomputer  seienee 
departments  of  four  universities.  Different  versions  of  this  data  have  been  used  in  previous  work.  To  make  a 
fair  eomparison,  we  used  the  version  from  (Lowd  &  Domingos,  2007),  whieh  eontains  4,165  web  pages  and 
10,935  web  links.  Eaeh  page  is  labeled  with  a  subset  of  the  eategories:  course,  department,  faculty,  person, 
professor,  research  project,  and  student.  The  goal  is  to  prediet  these  eategories  from  the  words  and  links  on 
the  web  pages.  We  used  the  same  simple  MLN  from  (Lowd  &  Domingos,  2007),  whieh  only  has  elauses 
relating  words  to  page  elasses,  and  page  elasses  to  the  elasses  of  linked  pages. 

Has{+word,page)  =>  PageClass{+class , page) 

\Has{+word,page)  =>  PageClass{+class , page) 

PageClass{+c\ ,  pi)  A  Linked  {p\,p2)  =>  PageClass{+c2,  p2) 

The  plus  notation  ereates  a  separate  elause  for  eaeh  pair  of  word  and  page  elass,  and  for  eaeh  pair  of  elasses. 
The  final  MLN  eonsists  of  10,891  clauses,  and  a  weight  must  be  learned  for  each  one.  After  grounding,  each 
department  results  in  an  MLN  with  more  than  100,000  ground  clauses  and  5,000  query  atoms  in  a  complex 
network.  This  also  results  in  a  large  LP-relaxation  problem  for  MPE  inference. 

Lor  CiteSeer  ,  we  used  the  dataset  and  MLN  used  in  (Poon  &  Domingos,  2007).  The  dataset  has  1,563 
citations  and  each  of  them  is  segmented  into  three  fields:  Author,  Title  and  Venue.  The  datasef  has  four 
disconnected  segments  corresponding  to  four  different  research  topics.  We  used  the  simplest  MLN  in  (Poon 
&  Domingos,  2007),  which  is  the  isolated  segmentation  model.  Despite  its  simplicity,  after  grounding,  this 
model  results  in  a  large  network  with  more  than  30,000  query  atoms  and  1 10,000  ground  clauses. 

All  the  datasets  and  MLNs  can  be  found  at  the  Alchemy  website.^ 

4.4.2  Methodology 

Lor  the  max-margin  weight  learner,  we  used  a  simple  process  for  selecting  the  value  of  the  C  parameter.  Lor 
each  train/test  split,  we  trained  the  algorithm  with  five  different  values  of  C:  1,  10,  100,  1000,  and  10000, 
then  selected  the  one  which  gave  the  highest  average  Fi  score  on  training.  The  e  parameter  was  set  to  0.001. 
To  solve  the  QP  problems  in  Algorithm  1  and  LP  problems  in  the  LP-relaxation  MPE  inference,  we  used 
the  Mosek  ^  solver.  The  PSCG  algorithm  was  carefully  tuned  by  its  author.  Lor  MC-SAT,  we  used  the 
default  setting,  100  burn-in  and  1000  sampling  iterations,  and  predict  that  an  atom  is  true  iff  its  probability 
is  at  least  0.5. 

Lor  the  Alzheimer’s  datasets,  we  used  the  same  experimental  setup  mentioned  in  section  3.3.2,  and  ran 
four-fold  cross-validation  (i.e.  leave  one  university/topic  out)  on  the  WebKB  and  CiteSeer  datasets. 

We  used  Fi,  the  harmonic  mean  of  recall  and  precision,  to  measure  the  performance  of  each  algorithm  on 
the  WebKB  and  CiteSeer  datasets.  This  is  the  standard  evaluation  metric  in  multi-class  text  categorization 
and  information  extraction. 

4.4.3  Results  and  Discussion 

Table  8  and  10  present  the  performance  of  different  systems  on  the  WebKB  and  Citeseer  datasets.  Each 
system  is  named  by  the  weight  learner  used,  the  loss  function  used  in  training,  and  the  inference  algorithm 
used  in  testing.  Lor  max-margin  (MM)  learner  with  margin  rescaling,  the  inference  used  in  training  is  the 
loss-augmented  version  of  the  one  used  in  testing.  Lor  example,  MM-A//ammmg-LPRelax  is  the  max-margin 

^http :  //  alchemy,  cs  .Washington  .edu 
^http://www.mosek.com/ 


21 


Table  8:  Fi  scores  on  Web  KB 


Cornell 

Texas 

Washington 

Wisconsin 

Average 

PSCG-MCSAT 

0.418 

0.298 

0.577 

0.568 

0.465 

PSCG-LPRelax 

0.420 

0.310 

0.588 

0.575 

0.474 

MM-Ao/i  -MaxWalkSAT 

0.150 

0.162 

0.122 

0.122 

0.139 

MM- Aq  / 1  -LPRelax 

0.282 

0.372 

0.675 

0.521 

0.462 

MM-A//Qmm,>;g-LPRelax 

0.580 

0.451 

0.715 

0.659 

0.601 

Table  9:  Fi  scores  of  different  inference  algorithms  on  Web  KB 


Cornell 

Texas 

Washington 

Wisconsin 

Average 

PSCG-MCSAT 

0.418 

0.298 

0.577 

0.568 

0.465 

PSCG-MaxWalkSAT 

0.161 

0.140 

0.119 

0.129 

0.137 

PSCG-LPRelax 

0.420 

0.310 

0.588 

0.575 

0.474 

MM-Ajjamming  -MCS  AT 

0.470 

0.370 

0.573 

0.481 

0.473 

MM-A//ammmg-Max  Walks  AT 

0.185 

0.184 

0.150 

0.154 

0.168 

Af{  -LP  Relax 

0.580 

0.451 

0.715 

0.659 

0.601 

weight  learner  using  the  loss-augmented  (Hamming  loss)  LP-relaxation  MPE  inference  algorithm  in  training 
and  the  LP-relaxation  MPE  inference  algorithm  in  testing. 

Table  8  shows  that  the  model  trained  using  MaxWalkSAT  has  very  low  predictive  accuracy.  This  result 
is  consistent  with  the  result  presented  in  (Riedel,  2008)  which  also  found  that  the  MPE  solution  found  by 
MaxWalkSAT  is  not  very  accurate.  Using  the  proposed  LP-relaxation  MPE  inference  improves  the  Fi  score 
from  0.139  to  0.462,  the  MM-Ao/i-LPRelax  system.  Then  the  best  system  is  obtained  by  rescaling  the 
margin  and  training  with  our  loss-augmented  LP-relaxation  MPE  inference,  which  is  the  only  difference 
between  MM-A//ammmg-LPRelax  and  MM-Ao/i-LPRelax.  The  MM-A//ammmg-LPRelax  achieves  the  best  Fi 
score  (0.601),  which  is  much  higher  than  the  0.465  Fi  score  obtained  by  the  current  best  discriminative 
weight  learner  for  MLNs,  PSCG-MCSAT. 

Table  9  compares  the  performance  of  the  proposed  LP-relaxation  MPE  inference  algorithm  against  MC- 
SAT  and  MaxWalkSAT  on  the  best  trained  models  by  PSCG  and  MM  on  the  WebKB  dataset.  In  both  cases, 
the  LP-relaxation  MPE  inference  achieves  much  better  F\  scores  than  those  of  MCSAT  and  MaxWalkSAT. 
This  demonstrates  that  the  approximate  MPE  solution  found  by  the  LP-relaxation  algorithm  is  much  more 
accurate  than  the  one  found  by  the  MaxWalkSAT  algorithm.  The  fact  that  the  performance  of  the  LP- 
relaxation  is  higher  than  that  of  MCSAT  shows  that  in  collective  classification  it  is  better  to  use  the  MPE 
solution  as  the  prediction  than  the  marginal  prediction. 

Eor  the  WebKB  dataset,  there  are  other  results  reported  in  previous  work,  such  as  those  in  (Taskar  et  al., 
2003),  but  those  results  cannot  be  directly  compared  to  our  results  since  we  use  a  different  version  of  the 
dataset  and  test  on  a  more  complicated  task  (a  page  can  have  multiple  labels  not  just  one). 

On  the  Citeseer  results  presented  in  Table  10,  the  performance  of  max-margin  methods  are  very  close 
to  those  of  PSCG.  However,  its  performance  is  much  more  stable.  Table  1 1  shows  the  performance  of  MM 
weight  learners  and  PSCG  with  different  parameter  values  by  varying  the  C  value  for  MM  and  the  number 
of  iterations  for  PSCG.  The  best  number  of  iterations  for  PSCG  is  9  or  10.  In  principle,  we  should  run  PSCG 
until  it  converges  to  get  the  optimal  weight  vector.  However,  in  this  case,  the  performance  of  PSCG  drops 


22 


Table  10:  F\  scores  on  CiteSeer 


Constraint 

Face 

Reasoning 

Reinforcement 

Average 

PSCG-MCSAT 

0.937 

0.914 

0.931 

0.975 

0.939 

^</\^A-A}{amming  "LP  RclaX 

0.933 

0.922 

0.924 

0.958 

0.934 

Table  1 1 :  Fi  scores  on  CiteSeer  with  different  parameter  values 


Constraint 

Face 

Reasoning 

Reinforcement 

Average 

PSCG-MCSAT-5 

0.852 

0.844 

0.836 

0.923 

0.864 

PSCG-MCSAT- 10 

0.937 

0.914 

0.931 

0.973 

0.939 

PSCG-MCSAT- 15 

0.878 

0.896 

0.780 

0.891 

0.861 

PSCG-MCSAT-20 

0.850 

0.859 

0.710 

0.784 

0.801 

PSCG-MCSAT- 100 

0.658 

0.697 

0.600 

0.668 

0.656 

MM-A//^;^;^//j^-LPRelax- 1 

0.933 

0.922 

0.924 

0.955 

0.934 

MM- A  Hamming  -LPRclaX- 1 0 

0.926 

0.922 

0.925 

0.955 

0.932 

MM- A  Hamming  -LPRclaX- 1 00 

0.926 

0.922 

0.925 

0.954 

0.932 

MM- A  Hamming  -LPRclaX- 1 000 

0.931 

0.918 

0.925 

0.958 

0.933 

MM- A  Hamming  -LPRclaX- 1 0000 

0.932 

0.922 

0.919 

0.968 

0.935 

drastically  on  both  training  and  testing  after  a  certain  number  of  iterations.  For  example,  from  Table  11  we 
can  see  that  at  10  iterations  PSCG  achieves  the  best  Fi  score  of  0.939,  but  after  15  iterations,  its  Fi  score 
drops  to  0.861  which  is  much  worse  than  those  of  the  max-margin  weight  learners.  Moreover,  if  we  let  it 
run  until  100  iterations,  then  its  Fi  score  is  only  0.656.  On  the  other  hand,  the  performance  of  MM  only 
varies  a  little  bit  with  different  values  of  C  and  we  don’t  need  to  tune  the  number  of  iterations  of  MM.  On 
this  dataset,  (Poon  &  Domingos,  2007)  achieved  a  Fi  score  of  0.944  with  the  same  MLN  by  using  a  version 
of  the  voted  perceptron  algorithm  called  Contrastive  Divergence  (CD)  (Hinton,  2002)  to  learn  the  weights. 
However,  the  performance  of  the  CD  algorithm  is  very  sensitive  to  the  learning  rate  (Lowd  &  Domingos, 
2007),  which  requires  a  very  careful  tuning  process  to  learn  a  good  model. 

Table  12  and  13  compares  the  performance  of  the  MM  weight  learners  against  the  some  of  the  systems 
described  in  section  3  for  the  case  when  the  transitive  clause  is  included.  For  the  MM  weight  learner,  instead 
of  adding  the  transitive  clause  to  the  learnt  MLNs  in  testing,  we  learned  the  weights  with  the  presence  of 
the  transitive  clause  since  it  can  handle  recursive  clauses.  In  term  of  the  accuracy,  the  MM  weight  learner  is 
a  little  bit  worse  than  the  ones  proposed  in  the  previous  section.  However,  the  1-norm  MM  weight  learner 
(MM-Ll-LPRelax)  produces  a  very  compact  model,  with  less  than  50  clauses,  with  high  accuracy  while  the 
models  learnt  by  other  systems  have  thousands  of  clauses. 

Regarding  training  time,  the  max-margin  weight  learner  is  comparable  to  other  learners.  On  the 
Alzheimer’s  datasets,  it  took  less  than  100  iterations  to  find  the  optimal  weights,  which  resulted  in  a  few 
minutes  of  training.  For  the  WebKB  and  CiteSeer  datasets,  the  number  of  training  iterations  are  about  200 
and  50  respectively,  which  takes  a  few  hours  of  training  for  WebKB  and  less  than  an  hour  for  CiteSeer. 


23 


Table  12:  Average  predietive  aeeuraeies  and  standard  deviations  on  Alzheimer’s  datasets  with  transitive 
elause  added 


Data  set 

Aleph 

ExactL2 

Aleph-h- 

ExactL2 

Alephh-h- 

ExactLl 

Aleph 

MM-LPRelax 

Alephh-h- 

MM-LPRelax 

Aleph-h- 

MM-Ll-LPRelax 

Alzheimer  amine 

87.0  ±  3.3 

91.7±3.5 

90.5  ±  3.6 

87.0  ±  2.2 

89.2±  2.9 

88.8  ±  3.0 

Alzheimer  toxic 

88.8  ±4.8 

91.4±3.6 

91.9  ±4.1 

88.5  ±  4.2 

90.8±  3.6 

91.6  ±4.3 

Alzheimer  acetyl 

84.1  ±  3.1 

88.7±  2.1 

87.6  ±  2.7 

86.3  ±  2.8 

88.3±  2.9 

87.9  ±  2.8 

Alzheimer  memory 

76.5  ±  3.5 

81.3±4.8 

81.3  ±4.1 

79.1  ±  3.0 

81.5±4.2 

80.7  ±  4.0 

Table  13:  Average  number  of  elauses  learned  on  Alzheimer’s  datasets 


Data  set 

Aleph 

Aleph-h- 

Alephh-h- 

ExactL2 

Alephh-h- 

ExactLl 

Aleph-h- 

MM-LPRelax 

Alephh-h- 

MM-Ll-LPRelax 

Alzheimer  amine 

10 

7061 

5070 

3477 

6981 

35 

Alzheimer  toxic 

9 

2034 

1194 

747 

2034 

25 

Alzheimer  acetyl 

12 

8662 

5427 

2433 

8621 

51 

Alzheimer  memory 

11 

6524 

4250 

2471 

6297 

31 

4.5  Related  Work 

Our  work  is  related  to  various  previous  projeets.  Among  them,  M3N  (Taskar  et  al.,  2003)  is  probably 
the  most  related.  It  is  a  speeial  ease  of  struetural  SVMs  where  the  feature  funetion  'F{x,y)  is  represented 
by  a  Markov  network.  When  the  Markov  network  ean  be  triangulated  and  the  loss  funetion  ean  be  lin¬ 
early  deeomposed,  the  original  exponentially-sized  QP  ean  be  reformulated  as  a  polynomially-sized  QP 
(Taskar  et  al.,  2003).  Then,  the  polynomially-sized  QP  ean  be  solved  by  general  QP  solvers  (Anguelov, 
Taskar,  Chatalbashev,  Roller,  Gupta,  Heitz,  &  Ng,  2005),  deeomposition  methods  (Taskar  et  al.,  2003),  ex¬ 
tragradient  methods  (Taskar,  Laeoste-Julien,  &  Jordan,  2006),  or  exponentiated  gradient  methods  (Collins, 
Globerson,  Koo,  Carreras,  &  Bartlett,  2008).  As  mentioned  in  (Taskar  et  al.,  2003),  these  methods  ean  also 
be  used  when  the  graph  eannot  be  triangulated,  but  the  algorithms  only  yield  approximate  solutions  like  our 
approaeh.  However,  these  algorithms  are  restrieted  to  the  eases  where  a  polynomially-sized  reformulation 
exists  (Joaehims  et  al.,  2009).  Consequently,  in  this  work  we  used  the  general  eutting  plane  algorithm  whieh 
imposes  no  restrietions  on  the  representation.  The  ground  MLN  ean  be  any  kind  of  graph.  On  the  other 
hand,  sinee  an  MLN  is  a  template  for  eonstrueting  Markov  networks  (Riehardson  &  Domingos,  2006),  the 
proposed  model,  M3LN,  ean  also  be  seen  as  a  template  for  eonstrueting  M3Ns.  Henee,  when  the  ground 
MLN  ean  be  triangulated  and  the  loss  is  a  linearly  deeomposable  funetion,  the  algorithms  developed  for 
M3Ns  ean  be  applied.  Our  work  is  also  elosely  related  to  the  Relational  Markov  Networks  (RMNs)  (Taskar 
et  al.,  2002).  However,  by  using  MLNs,  M3LNs  are  more  powerful  than  RMNs  in  term  of  representation 
(Riehardson  &  Domingos,  2006).  Besides,  the  objeetives  of  M3LNs  and  RMNs  are  different.  One  tries 
to  maximize  the  margin  between  the  true  assignment  and  other  eompeting  assignments,  and  one  tries  to 
maximize  the  eonditional  likelihood  of  the  true  assignment.  Another  related  system  is  Rumble  (Riiekert 
&  Kramer,  2007),  a  margin-based  approaeh  to  first-order  rule  learning.  In  that  work,  the  goal  is  to  find 
a  sef  of  weighfed  rules  fhaf  maximizes  a  quanfify  ealled  margin  minus  varianee.  However,  unlike  M3LNs, 
Rumble  only  applies  fo  independenf  binary  elassifieafion  problems  and  is  unable  fo  perform  sfruefured  pre- 


24 


diction  or  collective  classification.  In  terms  of  applying  the  general  structural  SVM  framework  to  a  specific 
representation,  our  work  is  related  to  the  work  in  (Szummer,  Kohli,  &  Hoiem,  2008)  which  used  CRFs  as 
the  representation  and  graph  cuts  as  the  inference  algorithm.  In  the  context  of  discriminative  learning,  our 
work  is  related  to  previous  work  on  discriminative  training  for  MLNs  (Singla  &  Domingos,  2005;  Lowd  & 
Domingos,  2007;  Huynh  &  Mooney,  2008;  Biba  et  ah,  2008).  We  have  mentioned  some  of  them  (Singla 
&  Domingos,  2005;  Lowd  &  Domingos,  2007;  Huynh  &  Mooney,  2008)  in  previous  sections.  The  main 
difference  between  the  work  in  (Biba  et  ah,  2008)  and  ours  is  that  we  assume  the  structure  is  given  and  apply 
max-margin  framework  to  learn  the  weights  while  (Biba  et  ah,  2008)  tries  to  learn  a  structure  that  maximizes 
the  conditional  likelihood  of  the  data.  Extending  the  max-margin  framework  to  structure  learning  is  an  area 
for  future  work. 

4.6  Summary 

We  have  presented  a  max-margin  weight  learning  method  for  MLNs  based  on  the  framework  of  structural 
SVMs.  It  resulted  in  a  new  model,  M3LN,  that  has  the  representational  expressiveness  of  MLNs  and  the 
predictive  performance  of  SVMs.  M3LNs  can  be  trained  to  optimize  different  performance  measures  de¬ 
pending  on  the  needs  of  the  application.  To  train  the  proposed  model,  we  developed  a  new  approximation 
algorithm  for  loss-augmented  MPE  inference  in  MLNs  based  on  LP-relaxation.  The  experimental  results 
showed  that  the  new  max-margin  learner  generally  has  better  or  equally  good  but  more  stable  predictive 
accuracy  (as  measured  by  Fi)  than  the  current  best  discriminative  MLN  weight  learner. 


25 


5  Proposed  Research 

5.1  Improving  the  predictive  performance 

5.1.1  Revising  MLNs 

In  the  previous  section,  we  looked  at  the  problem  of  learning  weights  for  a  given  set  of  clauses  provided 
as  the  input.  We  assumed  that  the  input  clauses  are  correct  and  only  learnt  weights  for  them.  However, 
these  clauses  usually  provided  by  domain  experts  may  be  too  general  or  too  specific,  thus  they  are  not  good 
for  prediction.  In  fact,  in  many  cases,  when  we  look  at  the  learnt  MLNs,  there  are  clauses  whose  weights 
are  very  small  (nearly  zero),  which  do  not  have  any  predictive  power.  Therefore,  it  would  be  better  if  we 
could  revise  these  clauses.  First-order  theory  revision  is  a  well-studied  problem  (Wrobel,  1996).  However, 
revising  a  first-order  probabilistic  model  like  MLNs  is  a  new  problem  and  only  a  few  work  have  looked  at 
this  problem  (Revoredo  &  Zaverucha,  2002;  Paes,  Revoredo,  Zaverucha,  &  Costa,  2005;  Mihalkova,  Huynh, 

6  Mooney,  2007).  All  of  the  existing  work  is  based  on  FORTE  (Richards  &  Mooney,  1995),  a  successful 
first-order  theory  revision  system.  Among  them,  (Mihalkova  et  ah,  2007)  is  the  one  that  deals  with  revising 
MLNs  in  the  context  of  transfer  learning.  Based  on  this  work,  we  propose  the  following  procedure  to  revise 
an  MLN  in  a  discriminative  manner: 

1.  Step  1  :  This  step  is  similar  to  the  Self-Diagnosis  step  of  the  Rtamar  (Mihalkova  et  ah,  2007)  to 
generate  revision  points  for  the  next  steps.  However,  we  make  some  modifications.  First,  instead  of 
transferring  the  weights  from  the  source  MLN  to  the  target  MLN,  we  learn  the  weights  for  the  input 
MLN  by  using  a  discriminative  weight  learner.  The  clauses  with  very  small  weights  are  the  candidates 
for  revision.  This  is  different  from  Rtamar,  where  all  clauses  are  inspected  by  the  algorithm.  Then, 
we  set  the  values  of  all  of  the  groundings  of  the  query  predicates  to  unknown,  and  run  inference  with 
the  current  MLN  to  find  where  if  makes  wrong  predicfions.  For  a  wrong  predicfion  of  a  query  atom  X, 
all  ground  clauses  confaining  X  of  fhe  candidafe  clauses  will  be  inspecfed  and  classified  info  fwo  bins: 
[Relevant]  and  [Irrelevant].  The  [Relevant]  bin  consists  of  clauses  whose  premises  are  satisfied  and 
the  [Irrelevant]  bin  contains  clauses  whose  premises  are  not  satisfied.  In  comparison  to  Rtamar, 
we  combine  the  [Relevant;  Good]  and  [Relevant;  Bad]  into  the  [Relevant]  bin  and  the  same  for  the 
other  two  bins  [Irrelevant;  Good]  and  [Irrelevant;  Bad].  The  reason  for  that  is  because  assigning 
a  negative  weight  to  a  clause  in  the  bad  bin  will  make  it  become  a  good  one  and  vice  versa.  So,  in 
term  of  predictive  power,  the  clauses  in  the  good  bin  and  bad  bin  are  the  same,  and  it  is  the  job  of  the 
weight  learner  to  assign  the  correct  weights  to  these  clauses.  The  clauses  in  the  [Relevant]  bin  are  the 
one  that  too  general  since  the  premises  are  satisfied  but  the  weights  are  nearly  zero,  and  the  ones  in  the 
[Irrelevant]  bin  are  too  specific  since  their  premises  are  not  satisfied.  Then  we  count  how  many  times 
a  candidate  clause  falls  into  one  of  the  two  bins  by  diagnosing  all  the  wrong  predictions.  Finally,  if  a 
candidate  clause  is  placed  in  the  [Relevant]  or  [Irrelevant]  bin  more  than  p  percent  of  the  time,  it  is 
marked  for  lengthening  or  shortening  respectively.  The  role  of  the  threshold  p  is  to  ignore  the  random 
errors  introduced  by  the  inference  algorithm.  The  value  of  p  will  be  determined  in  experiment. 

2.  Step  2:  To  revise  the  marked  clauses,  we  can  use  the  top-down  beam  search  approach  in  Rtamar, 
a  stochastic  local  search  (Paes,  Zaverucha,  &  Costa,  2007)  or  a  bottom-up  approach  (Duboc,  Paes, 
&  Zaverucha,  2008).  The  weights  of  the  candidate  clauses  can  be  learnt  in  an  inexpensive  way  by 
keeping  the  weights  of  other  clauses  fixed.  Then  we  can  run  inference  and  score  the  candidate  clauses 
by  some  discriminative  metrics  such  as  accuracy  or  CLL.  This  step  will  be  terminated  when  the 
algorithm  cannot  find  any  new  clause  that  improves  the  score  of  the  model. 


26 


5,1.2  Optimizing  non-linear  performance  metrics 


In  section  4.3,  we  have  shown  how  to  optimize  a  linearly  decomposable  loss  function  such  as  the  Hamming 
loss.  Though,  there  are  other  popular  performance  metrics  (e.g.  Fi,  ROCArea)  that  are  not  linearly  decom¬ 
posable.  Thus,  the  loss  functions  corresponding  to  these  metrics  (for  example  the  F\  loss  =  1  —  Fi)  are  also 
non-linear.  Finding  the  most  violated  constraint  for  these  loss  functions  is  a  much  harder  problem.  One 
simple  approximation  is  to  find  the  MPE  solution  or  N-best  MPE  solution  (Yanover  &  Weiss,  2003)  then 
check  to  see  whether  any  of  them  violates  the  constraint.  If  no  violated  constraint  is  found,  then  the  current 
weight  vector  is  a  good  solution.  This  approach  will  not  guarantee  to  find  fhe  optimal  weighfs  buf  hopefully 
if  can  find  a  good  one. 

Regarding  fhe  Fi  loss  which  can  be  written  as: 

,  T  ^  ^TP  FP  +  FN 

/\ vl  —  1  _  _  —  _ 

+  +  2TP  +  FP  +  FN 

where  TP  is  fhe  number  of  frue  posifives,  FP  is  fhe  number  of  false  positives,  and  FN  is  fhe  number  of  false 
negafives.  The  quanfifies  TP,  FP,  and  FN  can  be  represenf  as  linear  funclions  of  fhe  query  variables  y,’s: 

TP=  I 
FP=  I  3^/ 

i:yf=0 

FA=  £  (l-y,) 

i:yj=l 

Thus,  fhe  Fi  loss  is  a  linear-fraclional  funclion  of  fhe  variables  y,’s.  Adding  fhis  linear-fraclional  loss  fo 
fhe  objective  funclion  of  fhe  MPE  problem  (OP5),  we  have  a  fraclional  programming  problem.  So  we 
may  use  lechniques  in  fraclional  programming  (Slancu-Minasian,  1997)  such  as  Dinkelbach’s  algorilhm 
(Dinkelbach,  1967)  lo  find  fhe  mosl  violaled  conslrainl  for  fhe  Fi  loss. 

5.2  More  efficient  learning  algorithm 
5.2.1  Online  learning 

So  far,  we  have  presenled  algorilhms  for  learning  in  fhe  balch  selling.  However,  Ihere  are  many  cases  where 
fhis  approach  becomes  compulalionally  expensive,  especially  when  fhe  number  of  fraining  examples  are 
huge.  Eor  example,  considering  fhe  problem  of  exlracling  informalion  from  lexl  documenl,  each  documenf 
is  an  example.  To  gel  a  high  accuracy  model,  we  usually  Irain  on  a  huge  corpus  conlaining  Ihousands 
of  documenls.  This  makes  balch  learning  cosily  and  inefficienl  since  we  need  lo  keep  all  of  Ihe  Iraining 
examples  in  memory  and  run  inference  on  Ihousands  of  Iraining  examples  in  each  iteration.  One  efficienl 
alternative  is  online  learning  which  processes  Ihe  Iraining  examples  sequentially. 

Some  of  Ihe  existing  weighl  learning  algorilhms  for  MENs  already  have  Ihe  abilily  lo  do  online  learning. 
The  slruclured  perceplron  algorilhm  (Singla  &  Domingos,  2005)  and  ils  varianl,  conlraslive  divergence 
(Eowd  &  Domingos,  2007),  can  operate  in  an  online  fashion  by  processing  one  example  al  a  time  and 
updating  Ihe  weighls  whenever  il  makes  a  wrong  prediction.  To  handle  overfilling,  we  can  use  some  varianls 
of  Ihe  slruclured  perceplron  such  as  voted  perceplron  or  averaged  perceplron  (Collins,  2002).  However,  Ihese 
simple  perception-based  algorilhm  do  nol  look  for  a  solution  lhal  has  a  large  margin.  There  are  some  existing 
work  on  margin-based  online  learning  for  slruclured  prediction  (e.g..  Crammer,  McDonald,  &  Pereira,  2005; 


27 


Crammer,  Dekel,  Keshet,  Shalev-Shwartz,  &  Singer,  2006;  Keshet,  Shalev-Shwartz,  Singer,  &  Chazan, 
2007;  Shalev-Shwartz,  2007;  Nathan  Ratliff  &  Zinkevieh,  2007).  We  plan  to  adapt  these  algorithms  to  the 
case  of  MLNs.  For  example,  we  can  rewrite  the  M3LNs  optimization  problem  (the  OP4)  as  follows: 


Optimization  Problem  6  (OP6):  Max-Margin  Markov  Logic  Networks 

1  ^ 

min  -w  w-|-Cc 
w,^>o  2 

s.t.  w^n(x,y)  >  max{4(y,y)  -hw^n(x,y)} 
yGF 

This  optimization  problem  can  be  cast  as  an  unconstrained  optimization  problem: 

Optimization  Problem  7  (OP7): 

min  ^w^w  +  C[max{4(y,y)  +  w^n(x,y)}  -  w^n(x,y)] 
w  2  yeF 


since  the  inequality  constraint  in  OP6  becomes  an  equality  at  the  optimal  solution.  Then  we  can  apply  the 
online  subgradient  algorithm  for  structured  prediction  (Nathan  Ratliff  &  Zinkevich,  2007)  to  solve  the  OP7. 
To  find  the  subgradients  for  this  problem,  we  need  to  solve  the  loss-augmented  inference  problem.  We  have 
developed  an  approximation  algorithm  for  solving  this  problem  in  section  4.3. 

Thus  far,  we  have  only  discussed  online  learning  of  weights,  but  a  complete  online  learning  system  must 
also  be  able  to  do  online  structure  learning  and  revision  since  the  errors  may  come  from  the  structure.  Thus 
only  fixing  the  weights  is  not  enough.  So  we  also  plan  to  look  at  the  problem  of  online  structure  learning 
and  revision.  In  online  learning,  we  need  to  do  both  structure  learning  and  revision  together  since  to  fix 
prediction  errors  on  a  given  example  we  may  need  to  either  revise  the  current  model  or  learn  new  clauses 
from  the  example  or  do  both.  This  problem  is  related  to  the  problem  of  incremental  theory  refinement 
(Mooney,  1992).  As  pointed  out  in  (Mooney,  1992),  an  incremental/online  learning  system  may  run  into  the 
problem  of  “snowballing”:  based  on  a  small  amount  of  data,  the  system  makes  bad  initial  changes  to  the 
model,  these  bad  changes  may  not  be  fixed  in  later  steps,  and  result  in  a  over-complicated  model  which  hurts 
the  general  performance.  So  the  challenge  in  online  structure  learning  and  revision  is  to  be  able  to  make 
good  decisions  based  on  only  a  small  amount  of  data,  the  current  example  or  a  subset  of  examples  seen 
so  far.  However,  there  is  a  good  news  for  MLNs.  Since  MLNs  define  a  probability  over  possible  worlds 
or  interpretations  (De  Raedt  &  Kersting,  2008),  its  training  examples  are  Herbrand  interpretations  which 
contain  more  information  than  positive/negative  examples  used  in  traditional  ILP  systems. 

5.2.2  Efficient  MPE  and  loss-augmented  MPE  inference  algorithms 

The  major  weakness  of  the  LP-relaxation  inference  algorithm  presented  in  section  4.2  is  that  it  operates  on 
the  ground  Markov  network,  i.e.  the  whole  MLN  must  be  fully  grounded.  Fully  instantiating  an  MLN  takes 
a  lot  of  time,  requires  a  lot  of  memory,  and  becomes  impossible  when  there  are  many  query  atoms  and  the 
model  contains  complex  relationships  among  them,  for  example  entity  resolution  and  joint  learning  problem 
(Singla  &  Domingos,  2006;  Poon,  Domingos,  &  Sumner,  2008).  There  is  some  existing  work  on  efficient 
inference  methods  for  MLNs  that  try  not  to  fully  ground  the  whole  network  such  as  lazy  inference  (Singla 
&  Domingos,  2006;  Poon  et  ah,  2008),  cutting  plane  inference  (CPI)  (Riedel,  2008),  and  lifted  inference 
(Singla  &  Domingos,  2008).  These  algorithms  exploit  different  aspects  of  relational  domains  and  structures 


28 


of  the  ground  Markov  network.  Lazy  inference  takes  advantage  of  the  sparsity  of  relational  domains:  most 
query  atoms  are  false.  CPI  utilizes  the  redundancy  of  the  ground  Markov  network:  predictions  based  on  lo¬ 
cal  information  already  satisfy  the  global  constraints.  Lifted  inference  exploits  the  symmetry  of  the  ground 
Markov  network:  some  structures  appear  multiple  times  in  the  network.  So  we  plan  to  combine  the  advan¬ 
tages  of  these  algorithms  into  a  more  efficient  inference  algorithm.  For  example,  we  can  reduce  the  size  of 
the  network  constructed  by  the  CPI  method  by  taking  into  account  the  the  sparsity  of  relational  domains. 
Since  most  query  atoms  are  false,  initially  we  only  need  to  ground  clauses  that  may  make  a  query  atom 
become  true.  These  are  ground  clauses  that  are  unsatisfied  (or  safisfied  if  fhe  weighf  is  negative)  assuming 
fhaf  all  fhe  query  afoms  are  false.  So  insfead  of  initializing  fhe  parfial  nefwork  wifh  all  fhe  groundings  of 
non-recursive  clauses  (Riedel,  2008),  we  only  need  fo  consider  a  subsef  of  fhem.  In  fhe  case  fhaf  fhe  partial 
nefwork  is  sfill  a  large  one,  we  can  apply  fhe  mefhods  in  tiffed  inference  fo  consfrucf  a  compressed  factor 
graph  of  fhe  nefwork.  Then  we  can  run  fhe  LP-relaxafion  inference  algorifhm  on  fhe  compressed  facfor 
graph. 

The  above  mefhod  can  also  be  used  for  solving  fhe  loss-augmenfed  inference  problem  if  fhe  loss  funclion 
is  decomposable  info  a  sef  of  ground  clauses.  For  example,  fhe  Hamming  loss  can  be  represenfed  by  adding 
a  unif  clause  wifh  weigh!  1  or  -1  for  every  false  or  frue  grounding  of  fhe  query  predicafes  respecfively. 

5.3  Experiments  on  additional  problems 

We  plan  fo  apply  M3LNs  fo  more  complex  problems  where  fhere  are  complicafed  relafionships  befween 
fhe  inpuf  and  oufpuf  variables  and  among  fhe  oufpuf  ones.  One  such  problem  is  join!  learning  in  nafural 
language  processing.  For  example,  considering  fhe  problem  of  joinfly  recognizing  enfifies  and  relations  in 
senfences  (Rofh  &  Yih,  2002),  firsf-order  logic  provides  a  nafural  way  to  express  fhe  patterns  for  idenlifying 
enfifies  and  relations  from  local  information  such  as  lexical  and  synfacfical  information  and  also  fhe  global 
relationships  befween  enfify  fypes  and  relafions,  efc.  Then  we  can  use  fhe  max-margin  weighf  learner 
to  learn  weighfs  for  fhese  clauses.  Anofher  inferesfing  join!  learning  problem  is  “scene  undersfanding” 
in  computer  vision(Li  &  Li,  2007;  Heifz,  Gould,  Saxena,  &  Roller,  2008;  Li,  Socher,  &  Fei-Fei,  2009), 
where  fhe  key  problem  is  fo  simulfaneously  recognize  fhe  overall  scene  and  fhe  componenf  objecfs  of  a 
given  image.  To  achieve  a  high  performance,  besides  fhe  visual  feafures,  ones  need  to  fake  info  accounf 
fhe  relafionships  befween  objecfs  and  befween  objecfs  and  scenes.  Learning  fhese  fypes  of  relafionships  is 
whaf  a  sfafisfical  relafional  learning  model  like  MLNs  is  good  for.  The  challenge  of  join!  learning  is  to  be 
able  fo  handle  a  large  amounf  of  complicafed  dafa.  The  online  learning  and  efficienl  inference  algorifhm 
described  in  previous  secfions  will  help  fo  solve  Ibis  challenge.  Besides,  we  also  wanf  to  look  af  fhe  acfivify 
recognition  problem  (Tran  &  Davis,  2008)  which  also  requires  fhe  ability  fo  handle  complicafed  relafions 
and  inferacfions  befween  objecfs. 

6  Conclusions 

Learning  from  noisy  sfrucfured/relafional  dafa  is  one  of  fhe  key  problems  in  machine  learning.  Markov  logic 
nefworks,  a  formalism  fhaf  combines  fhe  expressivity  of  firsf-order  logic  wifh  fhe  flexibility  of  probabilistic 
reasoning,  are  a  powerful  model  fo  handle  such  kind  of  dafa.  Discriminafive  learning  is  an  imporfanf  research 
problem  in  MLNs  since  mosf  of  learning  problems  in  relafional  dafa  are  discriminafive.  In  fhis  proposal,  we 
have  presenfed  fwo  new  discriminafive  learning  algorifhms  for  MLNs.  The  firsf  algorifhm  is  a  discriminafive 
sfrucfure  and  weighf  learner  for  MLNs  wifh  non-recursive  clauses,  and  fhe  second  one  is  a  max-margin 
weigh!  learner  for  MLNs.  For  fulure  work,  our  shorf-lerm  goal  is  fo  develop  a  more  efficienl  MPE  inference 


29 


algorithm  for  MLNs  and  apply  our  max-margin  weight  learner  to  more  eomplex  problems  whieh  eontain 
eomplieated  relationships  between  input  and  output  variables  and  among  the  ouputs  sueh  as  joint  learning 
problem  in  natural  language  proeessing.  In  the  longer-term,  our  plan  is  to  develop  more  effieient  learning 
algorithms  through  online  learning  and  algorithms  that  revise  both  the  elauses  and  their  weights  to  improve 
predietive  performanee. 


Acknowledgments 

We  thank  Niels  Landwehr  for  helping  us  set  up  the  experiment  with  Aleph.  We  also  thank  Daniel  Lowd 
and  Hoifung  Poon  for  useful  diseussions  and  helping  with  the  experiments.  This  researeh  is  sponsored  by 
DARPA  and  managed  by  AFRL  under  eontraet  FA8750-05-2-0283.  The  projeet  is  also  partly  support  by 
ARO  grant  W911NF-08- 1-0242.  Most  of  the  experiments  were  run  on  the  Mastodon  Cluster,  provided  by 
NSF  Grant  EIA-0303609.  The  first  author  also  thanks  the  Vietnam  Edueation  Eoundation  (VEE)  for  its 
sponsorship. 


30 


References 


Andrew,  G.,  &  Gao,  J.  (2007).  Scalable  training  of  Li -regularized  log-linear  models.  In  Ghahramani,  Z. 
(Ed.),  Proceedings  of  24th  International  Conference  on  Machine  Learning  (ICML-2007),  pp.  33^0, 
Corvallis,  OR. 

Anguelov,  D.,  Taskar,  B.,  Chatalbashev,  V.,  Roller,  D.,  Gupta,  D.,  Heitz,  G.,  &  Ng,  A.  (2005).  Discriminative 
learning  of  Markov  random  fields  for  segmentation  of  3D  scan  data.  In  Proceedings  of  the  2005  IEEE 
Computer  Society  Conference  on  Computer  Vision  and  Pattern  Recognition  ( CVPR  ’05 )  -  Volume  2, 
pp.  169-176. 

Asano,  T.  (2006).  An  improved  analysis  of  Goemans  and  Williamson’s  LP-relaxation  for  MAX  SAT.  The¬ 
oretical  Computer  Science,  354(3),  339-353. 

Asano,  T,  &  Williamson,  D.  R  (2002).  Improved  approximation  algorithms  for  MAX  SAT.  Journal  of 
Algorithms,  42(1),  173-202. 

Biba,  M.,  Ferilli,  S.,  &  Esposito,  E.  (2008).  Discriminative  structure  learning  of  Markov  logic  networks. 
In  Proceedings  of  the  18th  international  conference  on  Inductive  Logic  Programming  (ILP’08),  pp. 
59-76,  Prague,  Czech  Republic.  Springer- Verlag. 

Boros,  E.,  &  Hammer,  P.  E.  (2002).  Pseudo-Boolean  optimization.  Discrete  Applied  Mathematics,  123(1-3), 
155-225. 

Bradley,  P.  S.,  &  Mangasarian,  O.  E.  (1998).  Feature  selection  via  concave  minimization  and  support  vector 
machines.  In  Proceedings  of  the  Eifteenth  International  Conference  on  Machine  Learning  (ICML-98), 
pp.  82-90,  Madison,  Wisconsin,  USA.  Morgan  Kaufmann  Publishers  Inc. 

Collins,  M.  (2002).  Discriminative  training  methods  for  hidden  Markov  models:  Theory  and  experiments 
with  perceptron  algorithms.  In  Proceedings  of  the  2002  Conference  on  Empirical  Methods  in  Natural 
Language  Processing  (EMNLP-02),  Philadelphia,  PA. 

Collins,  M.  (2004).  Parameter  estimation  for  statistical  parsing  models:  Theory  and  practice  of  distribution- 
free  methods.  In  Harry  Bunt,  J.  C.,  &  Satta,  G.  (Eds.),  New  Developments  in  Parsing  Technology. 
Kluwer. 

Collins,  M.,  Globerson,  A.,  Koo,  T.,  Carreras,  X.,  &  Bartlett,  P.  E.  (2008).  Exponentiated  gradient  algorithms 
for  conditional  random  fields  and  max-margin  Markov  nefworks.  Journal  of  Machine  Learning  Re¬ 
search,  9,  1775-1822. 

Crammer,  K.,  Dekel,  O.,  Keshef,  J.,  Shalev-Shwarfz,  S.,  &  Singer,  Y.  (2006).  Online  passive-aggressive 
algorifhms.  Journal  of  Machine  Learning  Research,  7,  551-585. 

Crammer,  K.,  McDonald,  R.,  &  Pereira,  F.  (2005).  Scalable  large-margin  online  learning  for  sfrucfured  clas- 
sificalion.  Tech,  rep.,  Deparfmenf  of  Compufer  and  Informalion  Science,  Universify  of  Pennsylvania. 

Crisfianini,  N.,  &  Shawe-Taylor,  J.  (2000).  An  Introduction  to  Support  Vector  Machines  and  Other  Kernel- 
based  Learning  Methods.  Cambridge  Universify  Press. 

Cussens,  J.  (2007).  Eogic-based  formalisms  for  sfafisfical  relafional  learning..  In  Gefoor,  E.,  &  Taskar,  B. 
(Eds.),  Introduction  to  Statistical  Relational  Learning,  pp.  269-290.  MIT  Press,  Cambridge,  MA. 

Davis,  J.,  Burnside,  E.  S.,  de  Casfro  Dufra,  I.,  Page,  D.,  &  Cosfa,  V.  S.  (2005).  An  infegrafed  approach  fo 
learning  Bayesian  nefworks  of  rules.  In  Proceedings  of  the  16th  European  Conference  on  Machine 
Learning  (ECML-05),  pp.  84-95. 


31 


Davis,  J.,  &  Goadrich,  M.  (2006).  The  relationship  between  preeision-recall  and  ROC  curves.  In  Proceed¬ 
ings  of  23rd  International  Conference  on  Machine  Learning  (ICML-2006),  pp.  233-240. 

De  Raedt,  L.,  &  Kersting,  K.  (2008).  Probabilistic  inductive  logic  programming.  In  Raedt,  L.  D.,  Frasconi, 
R,  Kersting,  K.,  &  Muggleton,  S.  (Eds.),  Probabilistic  Inductive  Logic  Programming,  Vol.  4911  of 
Lecture  Notes  in  Computer  Science,  pp.  1-27.  Springer. 

Dehaspe,  L.  (1997).  Maximum  entropy  modeling  with  clausal  constraints.  In  Dzeroski,  S.,  &  Lavrac,  N. 
(Eds.),  Proceedings  of  the  7th  International  Workshop  on  Inductive  Logic  Programming,  pp.  109-124. 

Dinkelbach,  W.  (1967).  On  nonlinear  fractional  programming.  Management  Science,  13(7),  492^98. 

Duboc,  A.  E.,  Paes,  A.,  &  Zaverucha,  G.  (2008).  Using  the  bottom  clause  and  mode  declarations  on  POE 
theory  revision  from  examples.  In  Proceedings  of  the  I8th  International  Conference  on  Inductive 
Logic  Programming  (ILP-2008),  pp.  91-106. 

Dudrk,  M.,  Phillips,  S.  J.,  &  Schapire,  R.  E.  (2007).  Maximum  entropy  density  estimation  with  generalized 
regularization  and  an  application  to  species  distribution  modeling.  Journal  of  Machine  Learning 
Research,  8,  1217-1260. 

Dzeroski,  S.  (1991).  Handling  noise  in  inductive  logic  programming.  Master’s  thesis,  Paculty  of  Electrical 
Engineering  and  Computer  Science,  University  of  Ejubljana. 

Dzeroski,  S.  (2007).  Inductive  logic  programming  in  a  nutshell..  In  Getoor,  E.,  &  Taskar,  B.  (Eds.),  Intro¬ 
duction  to  Statistical  Relational  Learning,  pp.  57-92.  MIT  Press,  Cambridge,  MA. 

Pinley,  T,  &  Joachims,  T.  (2008).  Training  structural  SVMs  when  exact  inference  is  intractable.  In 
Proceedings  of  25th  International  Conference  on  Machine  Learning  (ICML-2008),  pp.  304-311, 
Helsinki,Pinland. 

Pung,  G.  M.,  &  Mangasarian,  O.  E.  (2004).  A  feature  selection  Newton  method  for  support  vector  machine 
classification.  Computational  Optimization  and  Applications,  28(2),  1 85-202. 

Getoor,  E.,  &  Taskar,  B.  (Eds.).  (2007).  Introduction  to  Statistical  Relational  Learning.  MIT  Press,  Cam¬ 
bridge,  MA. 

Heitz,  G.,  Gould,  S.,  Saxena,  A.,  &  Roller,  D.  (2008).  Cascaded  classification  models:  Combining  models 
for  holistic  scene  understanding.  In  Roller,  D.,  Schuurmans,  D.,  Bengio,  Y.,  &  Bottou,  E.  (Eds.), 
Proceedings  of  the  Twenty-Second  Annual  Conference  on  Neural  Information  Processing  Systems, 
Vancouver,  British  Columbia,  Canada,  December  8-1 1,  2008,  pp.  641-648.  MIT  Press. 

Hinton,  G.  E.  (2002).  Training  products  of  experts  by  minimizing  contrastive  divergence.  Neural  Computa¬ 
tion,  I4{%),  1771-1800. 

Huynh,  T.  N.,  &  Mooney,  R.  J.  (2008).  Discriminative  structure  and  parameter  learning  for  Markov  logic 
networks.  In  Proceedings  of  25th  International  Conference  on  Machine  Learning  (ICML-2008),  pp. 
416-423,  Helsinki,  Pinland. 

Huynh,  T.  N.,  &  Mooney,  R.  J.  (2009).  Max-margin  weight  learning  for  Markov  logic  networks.  In  Pro¬ 
ceedings  of  the  European  Conference  on  Machine  Learning  and  Knowledge  Discovery  in  Databases 
(ECML  PKDD  2009),  Part  I,  PP-  564-579. 

Joachims,  T.  (2005).  A  support  vector  method  for  multivariate  performance  measures.  In  Proceedings  of 
22nd  International  Conference  on  Machine  Learning  (ICML-2005),  pp.  377-384. 

Joachims,  T,  Pinley,  T,  &  Yu,  C.-N.  (2009).  Cutting-plane  training  of  structural  SVMs.  Machine  Learning. 
http : //www. springer link . com/ content /h557723w8 8 185 170. 


32 


Kautz,  H.,  Selman,  B.,  &  Jiang,  Y.  (1997).  A  general  stoehastie  approaeh  to  solving  problems  with  hard  and 
soft  eonstraints.  In  Dingzhu  Gu,  J.  D.,  &  Pardalos,  P.  (Eds.),  The  Satisfiability  Problem:  Theory  and 
Applications,  pp.  573-586.  Ameriean  Mathematieal  Soeiety. 

Keshet,  J.,  Shalev-Shwartz,  S.,  Singer,  Y,  &  Chazan,  D.  (2007).  A  large  margin  algorithm  for  speeeh-to- 
phoneme  and  musie-to-seore  alignment.  IEEE  Transactions  on  Audio,  Speech  &  Language  Process¬ 
ing,  15{S),  2373-2382. 

King,  R.  D.,  Sternberg,  M.  J.  E.,  &  Srinivasan,  A.  (1995).  Relating  ehemieal  aetivity  to  strueture:  An 
examination  of  lEP  sueeesses.  New  Generation  Computing,  75(3,4),  411^33. 

Kok,  S.,  &  Domingos,  P.  (2005).  Eearning  the  strueture  of  Markov  logie  networks.  In  Proceedings  of  22nd 
International  Conference  on  Machine  Learning  (ICML-2005),  Bonn,Germany. 

Kok,  S.,  &  Domingos,  P.  (2009).  Eearning  Markov  logie  network  strueture  via  hypergraph  lifting.  In 
Proceedings  of  the  26th  International  Conference  on  Machine  Learning  (ICML-2009),  pp.  505-512, 
Montreal,  Quebee,  Canada. 

Kok,  S.,  Singla,  P,  Riehardson,  M.,  &  Domingos,  P.  (2005).  The  Alehemy  system  for  statistieal  relational 
AI.  Teeh.  rep..  Department  of  Computer  Seienee  and  Engineering,  University  of  Washington,  http  : 
//www . cs . Washington . edu/ ai /alchemy. 

Roller,  D.,  &  Pfeffer,  A.  (1998).  Probabilistie  frame-based  systems.  In  Proceedings  of  the  Eifteenth  National 
Conference  on  Artificial  Intelligence  (AAAI-98),  pp.  580-587,  Madison,  WI.  AAAI  Press  /  The  MIT 
Press. 

Kumar,  M.  P,  Kolmogorov,  V.,  &  Torr,  P.  H.  S.  (2009).  An  analysis  of  eonvex  relaxations  for  MAP  estima¬ 
tion  of  diserete  MREs.  Journal  of  Machine  Learning  Research,  70(Jan),  71-106. 

Eafferty,  J.,  MeCallum,  A.,  &  Pereira,  E.  (2001).  Conditional  random  fields:  Probabilistie  models  for  seg¬ 
menting  and  labeling  sequenee  data.  In  Proceedings  of  I8th  International  Conference  on  Machine 
Learning  (ICML-200I),  pp.  282-289,  Williamstown,  MA. 

Eandwehr,  N.,  Kersting,  K.,  &  Raedt,  E.  D.  (2007).  Integrating  Naive  Bayes  and  EOIE.  Journal  of  Machine 
Learning  Research,  8,  481-507. 

Eandwehr,  N.,  Passerini,  A.,  Raedt,  E.  D.,  &  Eraseoni,  P.  (2006).  kPOIE:  Eearning  simple  relational  kernels. 
In  Proceedings  of  the  Twenty-Pirst  National  Conference  on  Artificial  Intelligence  (AAAI-06). 

Eee,  S.,  Ganapathi,  V.,  &  Roller,  D.  (2007).  Effieient  strueture  learning  of  Markov  networks  using  Li- 
regularization.  In  Advances  in  Neural  Information  Processing  Systems  19  (NIPS  2006),  pp.  817-824. 

Ei,  E.-J.,  &  Ei,  E.-E.  (2007).  What,  where  and  who?  Classifying  events  by  seene  and  objeet  reeognition.  In 
Proceedings  of  the  Ilth  International  Conference  on  Computer  Vision  (ICCV-2007),  pp.  1-8. 

Ei,  E.-J.,  Soeher,  R.,  &  Eei-Eei,  E.  (2009).  Towards  total  seene  understanding lelassifieation,  annotation  and 
segmentation  in  an  automatie  framework.  In  Proceedings  of  the  IEEE  Computer  Vision  and  Pattern 
Recognition  (CVPR). 

Eiu,  D.  C.,  &  Noeedal,  J.  (1989).  On  the  limited  memory  BEGS  method  for  large  seale  optimization. 
Mathematic  Programming,  45(2),  503-528. 

Eowd,  D.,  &  Domingos,  P.  (2007).  Effieient  weight  learning  for  Markov  logie  networks.  In  Proceedings  of 
7th  European  Conference  of  Principles  and  Practice  of  Knowledge  Discovery  in  Databases  (ECML- 
PKDD-2007),  pp.  200-211. 


33 


Mihalkova,  L.,  Huynh,  T.,  &  Mooney,  R.  J.  (2007).  Mapping  and  revising  Markov  logie  networks  for  transfer 
learning.  In  Proceedings  of  the  Twenty-Second  Conference  on  Artificial  Intelligence  (AAAI-07),  pp. 
608-614,  Vaneouver,  BC. 

Mihalkova,  L.,  &  Mooney,  R.  J.  (2007).  Bottom-up  learning  of  Markov  logie  network  structure.  In  Pro¬ 
ceedings  of  24th  International  Conference  on  Machine  Learning  (ICML-2007),  Corvallis,  OR. 

Mooney,  R.  (1992).  Batch  versus  incremental  theory  refinement.  In  Proceedings  of  the  1992  AAAI  Spring 
Symposium  on  Knowledge  Assimilation. 

Muggleton,  S.  (2000).  Learning  stochastic  logic  programs.  In  Proceedings  of  the  AAAI2000  Workshop  on 
Learning  Statistical  Models  from  Relational  Data. 

Muggleton,  S.  (1995).  Inverse  entailment  and  Progol.  New  Generation  Computing,  13,  245-286. 

Nathan  Ratliff,  J.  A.  D.  B.,  &  Zinkevich,  M.  (2007).  (Online)  subgradient  methods  for  structured  predic¬ 
tion.  In  Proceedings  of  the  Eleventh  International  Conference  on  Artificial  Intelligence  and  Statistics 
(AlStats). 

Ng,  A.  Y.  (2004).  Feature  selection,  L[  vs.  L2  regularization,  and  rotational  invariance.  In  Proceedings  of2Ist 
International  Conference  on  Machine  Learning  (ICML-2004),  pp.  78-85,  Banff,  Alberta,  Canada. 

Paes,  A.,  Revoredo,  K.,  Zaverucha,  G.,  &  Costa,  V.  S.  (2005).  Probabilistic  first-order  theory  revision  from 
examples.  In  Proceedings  of  the  I5th  International  Conference  on  Inductive  Logic  Programming 
(ILP-2005),  pp.  295-311,  Bonn,  Germany. 

Paes,  A.,  Zaverucha,  G.,  &  Costa,  V.  S.  (2007).  Revising  first-order  logic  theories  from  examples  through 
stochastic  local  search.  In  Proceedings  of  the  1 7th  International  Conference  on  Inductive  Logic  Pro¬ 
gramming  (ILP-2007),  pp.  200-210. 

Poon,  H.,  &  Domingos,  P.  (2006).  Sound  and  efficient  inference  with  probabilistic  and  deterministic  depen¬ 
dencies.  In  Proceedings  of  the  Twenty-First  National  Conference  on  Artificial  Intelligence  (AAAI-06), 
Boston,  MA. 

Poon,  H.,  &  Domingos,  P.  (2007).  Joint  inference  in  information  extraction.  In  Proceedings  of  the  Twenty- 
Second  Conference  on  Artificial  Intelligence  (AAAI-07),  pp.  913-918,  Vancouver,  British  Columbia, 
Canada. 

Poon,  H.,  Domingos,  P,  &  Sumner,  M.  (2008).  A  general  method  for  reducing  the  complexity  of  relational 
inference  and  its  application  to  MCMC.  In  Proceedings  of  the  23rd  AAAI  Conference  on  Artificial 
Intelligence  (AAAI-08),  pp.  1075-1080. 

Revoredo,  K.,  &  Zaverucha,  G.  (2002).  Revision  of  first-order  Bayesian  classifiers.  In  Proceedings  of  the 
12th  International  Conference  on  Inductive  Logic  Programming  (ILP-2002),  pp.  223-237. 

Richards,  B.  L.,  &  Mooney,  R.  J.  (1995).  Automated  refinemenl  of  firsl-order  Horn-clause  domain  fheories. 
Machine  Learning,  19(2),  95-131. 

Richardson,  M.,  &  Domingos,  P.  (2006).  Markov  logic  nefworks.  Machine  Learning,  62,  107-136. 

Riedel,  S.  (2008).  Improving  fhe  accuracy  and  efficiency  of  MAP  inference  for  Markov  logic.  In  Proceed¬ 
ings  of  24th  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-2008),  pp.  468-475,  Helsinki, 
Finland. 

Rofh,  D.,  &  Yih,  W.-f.  (2002).  Probabilisfic  reasoning  for  enfify  &  relafion  recognition.  In  Proceedings  of 
the  I9th  international  conference  on  Computational  linguistics,  pp.  1-7,  Taipei,  Taiwan. 


34 


Riickert,  U.,  &  Kramer,  S.  (2007).  Margin-based  first-order  rule  learning.  Machine  Learning,  70(2-3), 
189-206. 

Shalev-Shwartz,  S.  (2007).  Online  Learning:  Theory,  Algorithms,  and  Applications.  Ph.D.  thesis.  The 
Hebrew  University  of  Jerusalem. 

Singla,  R,  &  Domingos,  P.  (2005).  Diseriminative  training  of  Markov  logie  networks.  In  Proceedings  of  the 
Twentieth  National  Conference  on  Artificial  Intelligence  (AAAI-05),  pp.  868-873. 

Singla,  P,  &  Domingos,  P  (2006).  Memory-eftieient  inferenee  in  relational  domains.  In  Proceedings  of  the 
Twenty-First  National  Conference  on  Artificial  Intelligence  (AAAI-06). 

Singla,  P,  &  Domingos,  P.  (2008).  Lifted  first-order  belief  propagation.  In  Proceedings  of  the  23rd  AAAI 
Conference  on  Artificial  Intelligence  (AAAI-08),  pp.  1094-1099,  Chieago,  Illinois,  USA. 

Slattery,  S.,  &  Craven,  M.  (1998).  Combining  statistieal  and  relational  methods  for  learning  in  hypertext 
domains.  In  Page,  D.  (Ed.),  Proceedings  of  the  8th  International  Workshop  on  Inductive  Logic  Pro¬ 
gramming  (ILP-98),  pp.  38-52.  Springer,  Berlin. 

Srinivasan,  A.  (2001).  The  Aleph  manual,  http://web.comlab.ox.ac.uk/oucl/research/ 
areas /machlearn/Aleph/. 

Staneu-Minasian,  I.  (1997).  Fractional  Programming:  Theory,  Methods  and  Applications.  Kluwer  Aea- 
demie  Publishers. 

Szummer,  M.,  Kohli,  P,  &  Hoiem,  D.  (2008).  Learning  CRFs  using  graph  euts.  In  Proceedings  of  the  lOth 
European  Conference  on  Computer  Vision  (ECCV’08),  pp.  582-595,  Marseille,  Franee.  Springer- 
Verlag. 

Taskar,  B.,  Chatalbashev,  V.,  Roller,  D.,  &  Guestrin,  C.  (2005).  Learning  struetured  predietion  models: 
a  large  margin  approaeh.  In  Proceedings  of  22nd  International  Conference  on  Machine  Learning 
(ICML-2005),  pp.  896-903,  Bonn,  Germany.  ACM. 

Taskar,  B.,  Guestrin,  C.,  &  Roller,  D.  (2003).  Max-margin  Markov  networks.  In  Advances  in  Neural 
Information  Processing  Systems  16  (NIPS  2003). 

Taskar,  B.,  Laeoste-Julien,  S.,  &  Jordan,  M.  I.  (2006).  Struetured  predietion,  dual  extragradient  and  Bregman 
projeetions.  Journal  of  Machine  Learning  Research,  7,  1627-1653. 

Taskar,  B.,  Abbeel,  P,  &  Roller,  D.  (2002).  Diseriminative  probabilistie  models  for  relational  data.  In 
Proceedings  of  1 8th  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-2002),  pp.  485-492, 
Edmonton,  Canada. 

Tran,  S.  D.,  &  Davis,  L.  S.  (2008).  Event  modeling  and  reeognition  using  markov  logie  networks.  In  Pro¬ 
ceedings  of  the  lOth  European  Conference  on  Computer  Vision  (ECCV),  Marseille,  Erance,  October 
12-18,  pp.  610-623. 

Tsoehantaridis,  I.,  Joaehims,  T.,  Hofmann,  T,  &  Altun,  Y.  (2004).  Support  veetor  maehine  learning  for 
interdependent  and  struetured  output  spaees.  In  Proceedings  of  2Ist  International  Conference  on 
Machine  Learning  (ICML-2004),  pp.  104-112,  Banff,  Canada. 

Tsoehantaridis,  I.,  Joaehims,  T.,  Hofmann,  T,  &  Altun,  Y.  (2005).  Large  margin  methods  for  struetured  and 
interdependent  output  variables.  Journal  of  Machine  Learning  Research,  6,  1453-1484. 

Werner,  T.  (2008).  High-arity  interaetions,  polyhedral  relaxations,  and  eutting  plane  algorithm  for  soft 
eonstraint  optimisation  (MAP-MRF).  In  Proceedings  of  the  2008  IEEE  Computer  Society  Conference 
on  Computer  Vision  and  Pattern  Recognition  ( CVPR  2008).  IEEE  Computer  Soeiety. 


35 


Wrobel,  S.  (1996).  First  order  theory  refinement.  In  De  Raedt,  L.  (Ed.),  Advances  in  Inductive  Logic 
Programming,  pp.  14—33.  lOS  Press,  Amsterdam. 

Yanover,  C.,  &  Weiss,  Y.  (2003).  Finding  the  M  most  prohahle  configurations  in  arbitrary  graphical  models. 
In  Thrun,  S.,  Saul,  L.  K.,  &  Scholkopf,  B.  (Eds.),  Advances  in  Neural  Information  Processing  Systems 
16  (NIPS  2003).  MIT  Press. 

Zhu,  J.,  Rosset,  S.,  Hastie,  T.,  &  Tibshirani,  R.  (2003).  1-norm  support  vector  machines.  In  Thrun,  S.,  Saul, 
E.  K.,  &  Scholkopf,  B.  (Eds.),  Advances  in  Neural  Information  Processing  Systems  16  (NIPS  2003), 
pp.  49-56.  MIT  Press. 

Zhu,  J.,  &  Xing,  E.  P.  (2009).  On  primal  and  dual  sparsity  of  Markov  networks.  In  Proceedings  of  the  26th 
International  Conference  on  Machine  Learning  (ICML-2009),  pp.  1265-1272,  Montreal,  Quebec, 
Canada. 


36 


